CSCI-S278: Applied Quantitative Finance and Machine Learning¶

Final Project¶

Harvard University
Summer 2024
Instructor: Dr Mark Antonio Awada
Submitted by: Bethun Bhowmik

Import required packages¶

Abstract¶

  • 30 stocks based on their historic performance and market cap in the Technology, Finance and Energy sectors for analysis
  • From the initial basket, 15 stocks with mean returns greater than 10% were filtered out as per my investment criteria and further analyzed
  • Modelling was conducted on the filtered basket using Multiple Linear Regression (MLR) and Support Vector Regression (SVR)
  • 9 pairs of stocks were identified as cointegrated suggesting potential trading opportunities where the price movements of those stocks may be predictably related over the long term
  • The maximum sharpe ratio portfolio had a sharpe ratio of 0.76, annualized return of 29% and annualized volatity of 22% with a significant allocation to high-growth stocks and some bond ETFs to balance risk
  • To further reduce sharpe ratio and volatility, the Efficient Frontier was analyzed. While the annualized return dropped from 29% to 11%, the sharpe ratio has also reduced from 0.76 to 0.64 and the annualized volatity drastically reduced from 22% to 7%
  • The constructed portfolio portfolio has a VaR return of 1.66% and a CVaR return of 2.18%, which is a good indicator of minimized portfolio risk
  • The average pairwise correlation of final constructed portfolio is 0.0118 indicating very low correlation and diversification
  • The hedged portfolio PnL is promising and above investment criteria defined
  • The Beta of the technology stocks basket is 1.3582, beta of the finance stocks basket is 1.1040; and the overall portfolio beta is 1.2492

Topics Explored¶

  • Data Preprocessing
  • Summary statistics and graphical analysis
  • Stock Returns
  • Stationarity Testing - ADF Test
  • Feature Engineering
  • Lasso Regression
  • Principal Component Analysis
  • Sharpe Ratio
  • Modeling
    • Simple Linear Regression (SLR)
    • Multiple Linear Regression (MLR)
    • Support Vector Regression (SVR)
  • Cointegration
  • Portfolio Asset Allocation
  • Efficient Frontier
  • Diversification Factor
  • Value at Risk; Conditional Value at Risk
  • Volatility, Hedged Portfolio PnL

Introduction¶

I shortlisted 30 stocks based on their historic performance and market cap in the Technology, Finance and Energy sectors for analysis by downloading their data from 2019-01-01. The stocks with mean returns greater than 1.1 are filtered out as per my investment criteria and further analyzed for portfolio creation

Tech Stocks:

  • AAPL - Apple Inc.
  • MSFT - Microsoft Corporation
  • NVDA - NVIDIA Corporation
  • AVGO - Broadcom Inc.
  • ORCL - Oracle Corporation
  • CRM - Salesforce, Inc.
  • ADBE - Adobe Inc.
  • AMD - Advanced Micro Devices, Inc.
  • ACN - Accenture plc
  • CSCO - Cisco Systems, Inc.

Finance Stocks:

  • BRK-B - Berkshire Hathaway Inc. (Class B)
  • JPM - JPMorgan Chase & Co.
  • V - Visa Inc.
  • MA - Mastercard Incorporated
  • BAC - Bank of America Corporation
  • SBBG - Silvergate Capital Corporation
  • WFC - Wells Fargo & Company
  • GS - The Goldman Sachs Group, Inc.
  • MS - Morgan Stanley
  • AXP - American Express Company

Energy Stocks:

  • XOM - Exxon Mobil Corporation
  • CVX - Chevron Corporation
  • COP - ConocoPhillips
  • EOG - EOG Resources, Inc.
  • SLB - SLB (formerly Schlumberger Limited)
  • EPD - Enterprise Products Partners L.P.
  • MPC - Marathon Petroleum Corporation
  • PSX - Phillips 66
  • OXY - Occidental Petroleum Corporation
  • ET - Energy Transfer LP

Import neccessary libraries and modules¶

We import the neccessary libraries and modules required for the project.

InĀ [1214]:
import yfinance as yf
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, Normalizer
from statsmodels.tsa.stattools import adfuller
from pandas.plotting import lag_plot
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from random import seed
from random import random
from scipy.stats import norm
from statsmodels.graphics.tsaplots import plot_acf
import pandas.plotting as pd_plotting
from IPython.display import display
from scipy.cluster import hierarchy
from scipy.spatial import distance
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from statsmodels import regression, stats
import statsmodels.api as sm
from scipy.cluster.hierarchy import dendrogram, linkage
import pylab
from sklearn.impute import SimpleImputer
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.pipeline import make_pipeline
from statsmodels.tsa.stattools import coint, adfuller
from sklearn.model_selection import TimeSeriesSplit , cross_val_score
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.manifold import TSNE
from matplotlib import cm
import scipy.optimize as sco
from sklearn.ensemble import RandomForestClassifier
from sklearn.covariance import LedoitWolf
from statsmodels.stats.stattools import jarque_bera
import warnings
warnings.filterwarnings("ignore")

Data Preprocessing¶

The data preprocessing workflow involved retrieving and combining stock tickers from the technology, finance, and energy sectors, followed by downloading their key statistics using the yfinance library. The data was then cleaned by converting all feature columns to numeric values, with non-numeric entries coerced to NaN, and missing values were imputed using the mean of each feature. The feature data was subsequently normalized to ensure consistency across all variables, with each feature scaled to have a mean of 0 and a standard deviation of 1. Finally, the original key statistics were saved to a CSV file, and a new DataFrame with the normalized data was created and inspected, ensuring the dataset was prepared for analysis.

InĀ [1038]:
tech_stocks = ['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO']

finance_stocks = ['BRK-B', 'JPM', 'V', 'MA', 'BAC', 'SBBG', 'WFC', 'GS', 'MS', 'AXP']

energy_stocks = ['XOM', 'CVX', 'COP', 'EOG', 'SLB', 'EPD', 'MPC', 'PSX', 'OXY', 'ET']

# Combine all tickers into a single list
all_tickers = tech_stocks + finance_stocks + energy_stocks

stock_data = yf.download(all_tickers, start="2019-01-01")
print(stock_data.head())
[*********************100%%**********************]  30 of 30 completed
Price       Adj Close                                                \
Ticker           AAPL         ACN        ADBE        AMD       AVGO   
Date                                                                  
2019-01-02  37.793777  129.528809  224.570007  18.830000  21.319847   
2019-01-03  34.029247  125.106438  215.699997  17.049999  19.423426   
2019-01-04  35.481922  129.971008  226.190002  19.000000  19.614336   
2019-01-07  35.402954  130.422394  229.259995  20.570000  20.013796   
2019-01-08  36.077839  133.720764  232.679993  20.750000  19.853174   

Price                                                                ...  \
Ticker            AXP        BAC       BRK-B        COP         CRM  ...   
Date                                                                 ...   
2019-01-02  88.261612  21.812504  202.800003  52.238148  135.162781  ...   
2019-01-03  86.538803  21.462948  191.660004  51.249107  130.027466  ...   
2019-01-04  90.438271  22.354324  195.199997  52.543110  137.565887  ...   
2019-01-07  90.929169  22.336843  196.910004  52.312328  141.813721  ...   
2019-01-08  91.373772  22.293144  196.309998  53.012909  145.303726  ...   

Price         Volume                                                        \
Ticker          MSFT       NVDA      ORCL      OXY      PSX SBBG       SLB   
Date                                                                         
2019-01-02  35329300  508752000  14320400  5355300  3147800    8  15926100   
2019-01-03  42579100  705552000  19868700  5465500  3182000    0  20007900   
2019-01-04  44060600  585620000  20984000  6367400  3307200    0  19506600   
2019-01-07  35656100  709160000  17967900  5771600  3474000    0  15736900   
2019-01-08  31514400  786016000  16255700  5420100  2357300    0  13070200   

Price                                     
Ticker             V       WFC       XOM  
Date                                      
2019-01-02   8788000  20295200  16727200  
2019-01-03   9428300  22262000  13866100  
2019-01-04  11065800  23343600  16043600  
2019-01-07  12928000  21858000  10844200  
2019-01-08   9243000  19702900  11439000  

[5 rows x 180 columns]

InĀ [1040]:
key_stats_list = []

# Retrieve key statistics for each ticker
for ticker in all_tickers:
    try:
        stock = yf.Ticker(ticker)
        key_stats = stock.info
        key_stats['Ticker'] = ticker
        key_stats_list.append(key_stats)
    except Exception as e:
        print(f"Could not retrieve data for {ticker}: {e}")

key_stats_df = pd.DataFrame(key_stats_list)

# Perform forward fill to handle missing data
key_stats_df.ffill(inplace=True)

# Save the DataFrame to a CSV file
key_stats_df.to_csv("key_statistics.csv", index=False)
print(key_stats_df.head())
                    address1         city state         zip        country  \
0         One Apple Park Way    Cupertino    CA       95014  United States   
1          One Microsoft Way      Redmond    WA  98052-6399  United States   
2  2788 San Tomas Expressway  Santa Clara    CA       95051  United States   
3          3421 Hillview Ave    Palo Alto    CA       94304  United States   
4            2300 Oracle Way       Austin    TX       78741  United States   

            phone                    website                   industry  \
0    408 996 1010      https://www.apple.com       Consumer Electronics   
1    425 882 8080  https://www.microsoft.com  Software - Infrastructure   
2    408 486 2000     https://www.nvidia.com             Semiconductors   
3    650 427 6000   https://www.broadcom.com             Semiconductors   
4  (737) 867-1000     https://www.oracle.com  Software - Infrastructure   

               industryKey               industryDisp  ... revenueGrowth  \
0     consumer-electronics       Consumer Electronics  ...         0.049   
1  software-infrastructure  Software - Infrastructure  ...         0.152   
2           semiconductors             Semiconductors  ...         2.621   
3           semiconductors             Semiconductors  ...         0.164   
4  software-infrastructure  Software - Infrastructure  ...         0.033   

  grossMargins ebitdaMargins operatingMargins  financialCurrency  \
0      0.45962       0.34175          0.29556                USD   
1      0.69764       0.52804          0.43143                USD   
2      0.75286       0.61768          0.64925                USD   
3      0.74244       0.49957          0.31765                USD   
4      0.71407       0.40080          0.33261                USD   

  trailingPegRatio  Ticker  address2  fax  industrySymbol  
0           2.1001    AAPL       NaN  NaN             NaN  
1           2.1709    MSFT       NaN  NaN             NaN  
2           1.1924    NVDA       NaN  NaN             NaN  
3           1.0732    AVGO       NaN  NaN             NaN  
4           1.9084    ORCL       NaN  NaN             NaN  

[5 rows x 136 columns]
InĀ [292]:
# Select the feature columns from csv file
features = key_stats_df.columns[34:132]
data_features = key_stats_df[features]
X = data_features.values
print(data_features.head())
   dividendRate  dividendYield  exDividendDate  payoutRatio  \
0          1.00         0.0048    1.723421e+09       0.1476   
1          3.00         0.0076    1.723680e+09       0.2483   
2          0.04         0.0004    1.718064e+09       0.0094   
3          2.10         0.0148    1.719187e+09       0.8488   
4          1.60         0.0125    1.720656e+09       0.4313   

   fiveYearAvgDividendYield   beta  trailingPE  forwardPE     volume  \
0                      0.68  1.244   31.541855  27.890982   69353188   
1                      0.91  0.894   33.807953  26.272846   24826640   
2                      0.11  1.680   60.964912  27.874332  386027827   
3                      2.85  1.188   62.034485  23.788430   24343940   
4                      1.52  1.014   34.557953  17.831710    5556125   

   regularMarketVolume  ...  returnOnEquity  freeCashflow  operatingCashflow  \
0             69353188  ...         1.60583  8.615812e+10       1.130410e+11   
1             24826640  ...         0.37133  5.670525e+10       1.185480e+11   
2            386027827  ...         1.15658  2.902375e+10       4.052400e+10   
3             24343940  ...         0.22229  2.022288e+10       1.894200e+10   
4              5556125  ...         1.93923  1.043725e+10       1.867300e+10   

   earningsGrowth  revenueGrowth  grossMargins  ebitdaMargins  \
0           0.111          0.049       0.45962        0.34175   
1           0.097          0.152       0.69764        0.52804   
2           6.500          2.621       0.75286        0.61768   
3           1.881          0.164       0.74244        0.49957   
4          -0.063          0.033       0.71407        0.40080   

   operatingMargins  financialCurrency  trailingPegRatio  
0           0.29556                USD            2.0592  
1           0.43143                USD            2.1290  
2           0.64925                USD            1.1428  
3           0.31765                USD            1.0463  
4           0.33261                USD            1.8921  

[5 rows x 98 columns]
InĀ [396]:
# Convert all feature columns to numeric, forcing errors to NaN
data_features = data_features.apply(pd.to_numeric, errors='coerce')

# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')
data_features_imputed = imputer.fit_transform(data_features)

# Normalize the raw feature data
scaler = StandardScaler()
data_features_N = scaler.fit_transform(data_features_imputed)

# Extract feature tickers
features_tickers = data_features.columns.values

# Save the DataFrame to a CSV file
key_stats_df.to_csv("key_statistics.csv", index=False)

# Display the head of the normalized DataFrame
data_features_N_df = pd.DataFrame(data_features_N, columns=features_tickers)
print(data_features_N_df.head())
   dividendRate  dividendYield  exDividendDate  payoutRatio  \
0     -0.783784      -1.064245        0.309385    -0.870566   
1      0.107893      -0.915226        0.310678    -0.460452   
2     -1.211789      -1.298418        0.282672    -1.433404   
3     -0.293362      -0.532034        0.288273     1.985165   
4     -0.516281      -0.654443        0.295597     0.284840   

   fiveYearAvgDividendYield      beta  trailingPE  forwardPE    volume  \
0                 -0.945587 -0.159321    0.166085   1.702049  0.632785   
1                 -0.838574 -0.826867    0.247353   1.463111 -0.013827   
2                 -1.210791  0.672251    1.221269   1.699590  5.231514   
3                  0.064052 -0.266128    1.259626   1.096258 -0.020836   
4                 -0.554759 -0.597994    0.274250   0.216677 -0.293672   

   regularMarketVolume  ...  returnOnAssets  returnOnEquity  freeCashflow  \
0             0.632785  ...        1.402023        2.360884      3.365327   
1            -0.013827  ...        0.592729       -0.033993      1.947319   
2             5.231514  ...        4.147093        1.489358      0.614595   
3            -0.020836  ...       -0.171592       -0.323124      0.190877   
4            -0.293672  ...       -0.199571        3.007666     -0.280251   

   operatingCashflow  earningsGrowth  revenueGrowth  grossMargins  \
0           2.496853       -0.339317      -0.188408     -0.316824   
1           2.641829       -0.345631       0.021765      0.557220   
2           0.587788        2.541994       5.059798      0.759996   
3           0.019625        0.458917       0.046252      0.721732   
4           0.012544       -0.417788      -0.221056      0.617553   

   ebitdaMargins  operatingMargins  trailingPegRatio  
0       0.251472          0.156603         -0.283586  
1       1.126601          0.861625         -0.277820  
2       1.547699          1.991883         -0.359285  
3       0.992858          0.271226         -0.367256  
4       0.528870          0.348853         -0.297389  

[5 rows x 84 columns]

Summary statistics and graphical analysis¶

On analyzing summary of the stocks by using describe() and also graphically plot the performance of the stocks from 2019-01-01 till 2014-08, it's observed that:

  • The mean values for MSFT and NVDA are extremely high compared to other stocks. The standard deviations for these two stocks are also very large, indicating high volatility.
  • The minimum and 25th percentile values for SBBG are 0, which is highly unusual. The closing price plot is also unusual and could indicate periods where the stock was inactive, delisted, or experienced extreme volatility.
  • Boxplots for NVDA, AMD, AVGO, and PSX show several outliers which could be due to specific stock realated events or indicates that they are more volatile.
InĀ [296]:
stock_data.describe()
Out[296]:
Price Adj Close ... Volume
Ticker AAPL ACN ADBE AMD AVGO AXP BAC BRK-B COP CRM ... MSFT NVDA ORCL OXY PSX SBBG SLB V WFC XOM
count 1408.000000 1408.000000 1408.000000 1408.000000 1408.000000 1408.000000 1408.000000 1408.000000 1408.000000 1408.000000 ... 1.408000e+03 1.408000e+03 1.408000e+03 1.408000e+03 1.408000e+03 1407.000000 1.408000e+03 1.408000e+03 1.408000e+03 1.408000e+03
mean 128.650285 258.436890 435.037429 87.411349 55.126213 144.190631 31.021857 279.887670 73.694817 204.438077 ... 2.854695e+07 4.619667e+08 1.073515e+07 1.752329e+07 3.241489e+06 0.646766 1.225496e+07 7.802563e+06 2.566035e+07 2.103017e+07
std 49.456658 63.908956 115.324439 41.625963 34.470324 39.173737 6.391193 67.969994 31.728508 46.068845 ... 1.218376e+07 1.956134e+08 6.648476e+06 1.414406e+07 1.425770e+06 11.702495 5.757358e+06 3.639250e+06 1.402453e+07 1.039236e+07
min 34.029243 125.106400 215.699997 17.049999 14.659641 64.769859 16.262651 162.130005 19.237530 123.944916 ... 8.989200e+06 9.788400e+07 2.168200e+06 2.446300e+06 1.081600e+06 0.000000 3.077600e+06 1.640900e+06 4.635500e+06 3.979400e+06
25% 78.796825 196.848690 335.487503 54.027501 27.744335 111.542433 25.929496 213.924999 48.168697 161.033649 ... 2.089445e+07 3.258312e+08 6.747825e+06 8.528325e+06 2.366625e+06 0.000000 8.539125e+06 5.382500e+06 1.623218e+07 1.399320e+07
50% 140.728851 270.918060 444.309998 85.430000 45.524628 148.024498 30.492549 280.789993 64.220913 202.175781 ... 2.568230e+07 4.331810e+08 8.894400e+06 1.353000e+07 2.941850e+06 0.000000 1.100995e+07 6.943200e+06 2.112420e+07 1.838580e+07
75% 169.090652 307.366241 515.902527 110.285000 60.902423 165.524677 35.791312 323.797508 106.432148 241.392776 ... 3.269340e+07 5.596882e+08 1.242005e+07 2.167385e+07 3.680250e+06 0.000000 1.439680e+07 9.171350e+06 3.046582e+07 2.547872e+07
max 234.820007 398.653076 688.369995 211.380005 182.308105 253.039993 46.190937 445.609985 132.675964 315.974762 ... 9.701270e+07 2.511528e+09 6.860570e+07 1.346690e+08 1.657980e+07 300.000000 8.614800e+07 3.837960e+07 1.189526e+08 8.443940e+07

8 rows Ɨ 180 columns

InĀ [1052]:
fig, axs = plt.subplots(3, 1, figsize=(14, 15))

# Plot tech stocks
for ticker in tech_stocks:
    axs[0].plot(stock_data['Close'][ticker], label=ticker)
axs[0].set_title('Tech Stocks Closing Prices')
axs[0].set_ylabel('Price')
axs[0].legend()

# Plot finance stocks
for ticker in finance_stocks:
    axs[1].plot(stock_data['Close'][ticker], label=ticker)
axs[1].set_title('Finance Stocks Closing Prices')
axs[1].set_ylabel('Price')
axs[1].legend()

# Plot energy stocks
for ticker in energy_stocks:
    axs[2].plot(stock_data['Close'][ticker], label=ticker)
axs[2].set_title('Energy Stocks Closing Prices')
axs[2].set_ylabel('Price')
axs[2].legend()

plt.tight_layout()
plt.show()
No description has been provided for this image
InĀ [1054]:
fig, axs = plt.subplots(3, 1, figsize=(14, 15))

# Plot log-transformed tech stocks
for ticker in tech_stocks:
    axs[0].plot(np.log(stock_data['Close'][ticker]), label=ticker)
axs[0].set_title('Tech Stocks Closing Prices (Log Transformed)')
axs[0].set_ylabel('Log Price')
axs[0].legend()

# Plot log-transformed finance stocks
for ticker in finance_stocks:
    axs[1].plot(np.log(stock_data['Close'][ticker]), label=ticker)
axs[1].set_title('Finance Stocks Closing Prices (Log Transformed)')
axs[1].set_ylabel('Log Price')
axs[1].legend()

# Plot log-transformed energy stocks
for ticker in energy_stocks:
    axs[2].plot(np.log(stock_data['Close'][ticker]), label=ticker)
axs[2].set_title('Energy Stocks Closing Prices (Log Transformed)')
axs[2].set_ylabel('Log Price')
axs[2].legend()

plt.tight_layout()
plt.show()
No description has been provided for this image
InĀ [308]:
# Extract close data for different sets of stocks
close_data_tech = stock_data['Close'][tech_stocks]
close_data_finance = stock_data['Close'][finance_stocks]
close_data_energy = stock_data['Close'][energy_stocks]

plt.figure(figsize=(18, 12))

# Boxplot for tech stocks
plt.subplot(3, 1, 1)
close_data_tech.boxplot()
plt.title('Boxplot of Closing Prices of Technology Stocks')
plt.xlabel('Stocks')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.grid(True)

# Boxplot for finance stocks
plt.subplot(3, 1, 2)
close_data_finance.boxplot()
plt.title('Boxplot of Closing Prices of Finance Stocks')
plt.xlabel('Stocks')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.grid(True)

# Boxplot for energy stocks
plt.subplot(3, 1, 3)
close_data_energy.boxplot()
plt.title('Boxplot of Closing Prices of Energy Stocks')
plt.xlabel('Stocks')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.grid(True)

plt.tight_layout()
plt.show()
No description has been provided for this image

Stock Returns¶

The mean returns for the stocks are calculated and shown in a descending order along with their standard deviations over a 15-day window for each of the three sectors.

InĀ [1080]:
def calculate_returns(stock_list, data):
    for stock in stock_list:
        stock_prices = data[stock].fillna(method='ffill')
        returns_dict[stock] = get_returns(stock_prices, 15)
        returns_means[stock] = np.mean(returns_dict[stock])
        returns_sds[stock] = np.std(returns_dict[stock])

# Calculate returns for tech, finance, and energy stocks
calculate_returns(tech_stocks, stock_data['Close'])
calculate_returns(finance_stocks, stock_data['Close'])
calculate_returns(energy_stocks, stock_data['Close'])

returns_stats_tech_df = pd.DataFrame({
    'Stock': tech_stocks,
    'Mean Return': [returns_means[stock] for stock in tech_stocks],
    'Standard Deviation': [returns_sds[stock] for stock in tech_stocks]
})

returns_stats_finance_df = pd.DataFrame({
    'Stock': finance_stocks,
    'Mean Return': [returns_means[stock] for stock in finance_stocks],
    'Standard Deviation': [returns_sds[stock] for stock in finance_stocks]
})

returns_stats_energy_df = pd.DataFrame({
    'Stock': energy_stocks,
    'Mean Return': [returns_means[stock] for stock in energy_stocks],
    'Standard Deviation': [returns_sds[stock] for stock in energy_stocks]
})

print("Tech Stocks Mean Returns and Standard Deviations")
print(returns_stats_tech_df.sort_values(by='Mean Return', ascending=False))

print("\nFinance Stocks Mean Returns and Standard Deviations")
print(returns_stats_finance_df.sort_values(by='Mean Return', ascending=False))

print("\nEnergy Stocks Mean Returns and Standard Deviations")
print(returns_stats_energy_df.sort_values(by='Mean Return', ascending=False))
Tech Stocks Mean Returns and Standard Deviations
  Stock  Mean Return  Standard Deviation
1  MSFT     3.392337           15.344852
6  ADBE     3.215391           37.699295
8   ACN     1.936867           17.861691
0  AAPL     1.930573            9.157680
3  AVGO     1.343918            5.853680
7   AMD     1.293742           11.887166
2  NVDA     1.151718            4.938479
5   CRM     1.138695           17.615138
4  ORCL     0.943376            5.736916
9  CSCO     0.034065            2.822704

Finance Stocks Mean Returns and Standard Deviations
   Stock  Mean Return  Standard Deviation
7     GS     3.267713           21.616423
3     MA     2.712939           19.810422
0  BRK-B     2.517914           12.804207
9    AXP     1.534251           10.958976
2      V     1.343326           10.741866
1    JPM     1.138452            8.864121
8     MS     0.620989            5.570682
4    BAC     0.143821            2.635675
6    WFC     0.094530            3.504981
5   SBBG   -67.741398          319.382442

Energy Stocks Mean Returns and Standard Deviations
  Stock  Mean Return  Standard Deviation
6   MPC     1.143548            8.425291
7   PSX     0.502072            8.590547
0   XOM     0.481183            5.441699
2   COP     0.472186            7.298642
1   CVX     0.440581            9.156263
3   EOG     0.308824            9.044866
4   SLB     0.061147            4.048081
5   EPD     0.024681            1.609299
9    ET     0.017864            0.968930
8   OXY    -0.058595            4.979771

Stocks with mean returns > 10%¶

The stocks with mean returns greater than 1.1 are filtered out as per my investment criteria

InĀ [1066]:
returns_stats_df = pd.DataFrame({
    'Stock': list(returns_means.keys()),
    'Mean Return': list(returns_means.values()),
    'Standard Deviation': list(returns_sds.values())
})

# Filter out stocks with mean return less than 1.10
filtered_stocks_df = returns_stats_df[returns_stats_df['Mean Return'] >= 1.10]

# Sort the filtered DataFrame by Mean Return in descending order
filtered_stocks_df = filtered_stocks_df.sort_values(by='Mean Return', ascending=False)
print(filtered_stocks_df)
    Stock  Mean Return  Standard Deviation
1    MSFT     3.392337           15.344852
17     GS     3.267713           21.616423
6    ADBE     3.215391           37.699295
13     MA     2.712939           19.810422
10  BRK-B     2.517914           12.804207
8     ACN     1.936867           17.861691
0    AAPL     1.930573            9.157680
19    AXP     1.534251           10.958976
3    AVGO     1.343918            5.853680
12      V     1.343326           10.741866
7     AMD     1.293742           11.887166
2    NVDA     1.151718            4.938479
26    MPC     1.143548            8.425291
5     CRM     1.138695           17.615138
11    JPM     1.138452            8.864121
InĀ [Ā ]:
 

The histograms of the filtered list of top-performing stocks are visualized after normalizing the returns using Z-score normalization. We observe a bell-shaped, symmetric histogram centered around zero for all 15 stocks indicating that their returns are normally distributed. Stocks that have a wider spread suggests higher volatility.

InĀ [1068]:
filtered_stocks = filtered_stocks_df['Stock'].tolist()

# Filter the original stock data to include only the filtered stocks
filtered_close_data = stock_data['Close'][filtered_stocks]

# Calculate returns for filtered stocks
filtered_returns = {stock: get_returns(filtered_close_data[stock].fillna(method='ffill'), 15) for stock in filtered_stocks}

# Apply Z-score normalization to the returns
scaler = StandardScaler()
normalized_returns = {stock: scaler.fit_transform(np.array(filtered_returns[stock]).reshape(-1, 1)).reshape(-1) for stock in filtered_stocks}

# Plot normalized returns in a 5x4 grid
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten()

for i, stock in enumerate(filtered_stocks):
    axes[i].hist(normalized_returns[stock], bins=50, alpha=0.75, color='blue')
    axes[i].set_title(f'Normalized Returns for {stock}')
    axes[i].set_xlabel('Normalized Return')
    axes[i].set_ylabel('Frequency')

# Hide any unused subplots
for ax in axes[len(filtered_stocks):]:
    ax.set_visible(False)

plt.tight_layout()
plt.show()
No description has been provided for this image

Correlation analysis shows that all stocks are positively correlated with values tanging to 0.28 to 0.99. Most stocks have a moderate to high correlation except ADBE and MPC which has a correlation of 0.28.

InĀ [1071]:
# Plot the correlation matrix for the filtered stocks
correlation_matrix = filtered_close_data.corr()
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1)
plt.title('Correlation Matrix of Stocks with Mean Return > 1.1')
plt.show()
No description has been provided for this image

Autocorrelation scatter plot for all filtered stocks indicates a strong autocorrelation, suggesting that each stock’s price on one day is strongly related to its prices on the previous days. All filtered stocks seem to be trending and momentum strategy could be effective.

InĀ [980]:
# Autocorrelation scatter plot analysis for the filtered stocks
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten()

for i, stock in enumerate(filtered_stocks):
    lag_plot(filtered_close_data[stock].dropna(), lag=1, ax=axes[i])
    axes[i].set_title(f'Autocorrelation Scatter Plot for {stock}')

# Hide any unused subplots
for ax in axes[len(filtered_stocks):]:
    ax.set_visible(False)

plt.tight_layout()
plt.show()
No description has been provided for this image

All stocks show a good probability of increasing and providing a likelihood of achieving a return > 20 over a 50 day period. Only AVGO and NVDA does not meet my investment criteria of mean return > 10% during the 50 day period. However, I would like to keep both AVGO and NVDA in the basket for analysis.

InĀ [982]:
# Initialize lists to store mean and standard deviation of returns over different investment periods
all_returns_means = {stock: [] for stock in filtered_stocks}
all_returns_sds = {stock: [] for stock in filtered_stocks}

# Calculate mean and standard deviation of returns for 50 day period
for stock in filtered_stocks:
    for invest_time in range(1, 50):
        returns = get_returns(filtered_close_data[stock].fillna(method='ffill'), days_out=invest_time)
        all_returns_means[stock].append(np.mean(returns))
        all_returns_sds[stock].append(np.std(returns))

# Plot probabilities for exceeding a return of 20 for each stock over different investment periods
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten()

for i, stock in enumerate(filtered_stocks):
    prob_list = []
    for j in range(len(all_returns_means[stock])):
        prob = 1 - norm.cdf(20, all_returns_means[stock][j], all_returns_sds[stock][j])
        prob_list.append(prob)
    axes[i].plot(prob_list)
    axes[i].set_title(f'Prob of exceeding return of 20 for {stock}')
    axes[i].set_xlabel('Investment Period (Days)')
    axes[i].set_ylabel('Probability')

for ax in axes[len(filtered_stocks):]:
    ax.set_visible(False)

plt.tight_layout()
plt.show()
No description has been provided for this image
InĀ [984]:
# Calculate probabilities for exceeding a return of 20
probabilities = {stock: [] for stock in filtered_stocks}
for stock in filtered_stocks:
    for j in range(len(all_returns_means[stock])):
        prob = 1 - norm.cdf(20, all_returns_means[stock][j], all_returns_sds[stock][j])
        probabilities[stock].append(prob)

# Calculate the maximum probability for each stock
max_probabilities = {stock: max(probabilities[stock]) for stock in filtered_stocks}

# Sort stocks by the highest probabilities
sorted_stocks = sorted(max_probabilities.items(), key=lambda x: x[1], reverse=True)

# Print the sorted stocks
print("Filtered Stocks Sorted by Highest Probability of Exceeding a Return of 20:")
for stock, prob in sorted_stocks:
    print(f"{stock}: {prob:.4f}")
Filtered Stocks Sorted by Highest Probability of Exceeding a Return of 20:
ADBE: 0.4387
GS: 0.3883
MSFT: 0.3690
MA: 0.3511
ACN: 0.3150
CRM: 0.2961
BRK-B: 0.2902
AMD: 0.2236
AXP: 0.2052
AAPL: 0.1961
V: 0.1690
JPM: 0.1473
MPC: 0.1450
AVGO: 0.0533
NVDA: 0.0451

Stationarity Testing - ADF Test¶

All the filtered stocks have highly negative ADF statistics, with values ranging from around -5.91 to -7.25 which suggests strong evidence of Stationarity. It's observed that the returns are not random walks and all stocks maintain consistent statistical properties over time. The consistency of these results across different sectors suggests that market dynamics influencing these stocks contribute to a stable return profile.

InĀ [986]:
# Function to perform ADF test
def adf_test(timeseries):
    result = adfuller(timeseries)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    
# Perform ADF test and plot the results in grids of 3
fig, axes = plt.subplots(len(filtered_stocks), 3, figsize=(15, 5 * len(filtered_stocks)))
axes = axes.flatten()

for i, stock in enumerate(filtered_stocks):
    rw_returns = filtered_returns[stock]

    print(f'ADF Test for {stock} returns:')
    adf_test(rw_returns)

    axes[i*3].plot(rw_returns, label=f'{stock} Returns')
    axes[i*3].set_title(f'{stock} Returns')
    axes[i*3].legend()

    # Random walk simulation
    seed(1)
    random_walk = []
    random_walk.append(-1 if random() < 0.5 else 1)
    for j in range(1, 1000):
        movement = -1 if random() < 0.5 else 1
        value = random_walk[j-1] + movement
        random_walk.append(value)

    axes[i*3+1].plot(random_walk, label='Random Walk')
    axes[i*3+1].set_title('Random Walk')
    axes[i*3+1].legend()

    pd_plotting.autocorrelation_plot(random_walk, ax=axes[i*3+2])
    axes[i*3+2].set_title('Autocorrelation of Random Walk')

for ax in axes[len(filtered_stocks)*3:]:
    ax.set_visible(False)

plt.tight_layout()
plt.show()
ADF Test for MSFT returns:
ADF Statistic: -7.000522
p-value: 0.000000
ADF Test for GS returns:
ADF Statistic: -7.252001
p-value: 0.000000
ADF Test for ADBE returns:
ADF Statistic: -6.407761
p-value: 0.000000
ADF Test for MA returns:
ADF Statistic: -7.021395
p-value: 0.000000
ADF Test for BRK-B returns:
ADF Statistic: -6.829493
p-value: 0.000000
ADF Test for ACN returns:
ADF Statistic: -6.898409
p-value: 0.000000
ADF Test for AAPL returns:
ADF Statistic: -6.735005
p-value: 0.000000
ADF Test for AXP returns:
ADF Statistic: -6.066305
p-value: 0.000000
ADF Test for AVGO returns:
ADF Statistic: -6.234501
p-value: 0.000000
ADF Test for V returns:
ADF Statistic: -6.849880
p-value: 0.000000
ADF Test for AMD returns:
ADF Statistic: -6.052814
p-value: 0.000000
ADF Test for NVDA returns:
ADF Statistic: -7.182806
p-value: 0.000000
ADF Test for MPC returns:
ADF Statistic: -6.274607
p-value: 0.000000
ADF Test for CRM returns:
ADF Statistic: -5.912784
p-value: 0.000000
ADF Test for JPM returns:
ADF Statistic: -6.096113
p-value: 0.000000
No description has been provided for this image

Feature Engineering¶

Most stocks have moderately scaled features, with values closer to zero, indicating that their financial and technical indicators are within a normal range, relative to their historical data. AMD and AVGO show high positive values for f1 (close-to-open return) and NVDA has the highest value for f10 (Z-score of closing prices), indicating that its closing price is significantly above its mean.

The hierarchical clustering dendrogram below provides a visual representation of the relationships between features based on their correlations. It's observed that f3 and f7 as well as f4 and f5 provide similar information and we can reducing the feature set by selecting one feature from both groups to avoid multicollinearity.

InĀ [988]:
# Filter the original tech_stocks DataFrame to include only the filtered stocks
filtered_stock_data = stock_data.loc[:, (slice(None), filtered_stocks)]

# Feature Engineering
data_transform = filtered_stock_data.stack(level=1)
features = pd.DataFrame(index=data_transform.index).sort_index()

# Calculate features
features['f1'] = data_transform['Close'] / data_transform['Open'] - 1
features['f2'] = data_transform['Open'] / data_transform.groupby(level=1)['Close'].shift(1) - 1
features['f3'] = data_transform['Volume'].apply(np.log)
features['f4'] = data_transform.groupby(level=1)['Volume'].diff()

diff_50 = lambda x: x.diff(50)
features['f5'] = data_transform.groupby(level=1)['Volume'].apply(diff_50).sort_index(level='Date', axis=0).droplevel(0)

pct_chg_fxn = lambda x: x.pct_change()
features['f6'] = data_transform.groupby(level=1)['Volume'].apply(pct_chg_fxn).sort_index(level='Date', axis=0).droplevel(0)

ma_5 = lambda x: x.rolling(5).mean()
features['f7'] = data_transform.groupby(level=1)['Volume'].apply(ma_5).apply(np.log).sort_index(level='Date', axis=0).droplevel(0)

ma_200 = lambda x: x.rolling(200).mean() - 1
features['f8'] = data_transform['Volume'] / data_transform.groupby(level=1)['Volume'].apply(ma_200).sort_index(level='Date', axis=0).droplevel(0)

ema_50 = lambda x: x.ewm(span=50).mean() - 1
features['f9'] = data_transform['Close'] / data_transform.groupby(level=1)['Close'].apply(ema_50).sort_index(level='Date', axis=0).droplevel(0)

zscore_fxn = lambda x: (x - x.mean()) / x.std()
features['f10'] = data_transform.groupby(level=1)['Close'].apply(zscore_fxn).sort_index(level='Date', axis=0).droplevel(0)

features['f10'].unstack().plot.kde(title='Z-Scores')
plt.show()
No description has been provided for this image
InĀ [1086]:
print(features.tail())
                         f1        f2         f3          f4           f5  \
Date       Ticker                                                           
2024-08-08 MA      0.007385  0.005335  14.276261   -474537.0   -1085537.0   
           MPC     0.024412  0.003838  14.431803     98648.0    -528952.0   
           MSFT    0.000721  0.009964  16.770303  -1453158.0    3479742.0   
           NVDA    0.029219  0.031139  19.773302 -24685314.0 -265972914.0   
           V       0.006586  0.006276  15.464718  -7705744.0   -1009244.0   

                         f6         f7        f8        f9       f10  
Date       Ticker                                                     
2024-08-08 MA     -0.230380  14.894516  0.623566  1.018084  1.817965  
           MPC     0.056261  14.637521  0.674100  1.017785  1.894379  
           MSFT   -0.070368  17.110919  0.851567  0.940253  1.652688  
           NVDA   -0.059997  19.921231  0.869428  0.934388  2.913131  
           V      -0.596947  16.237610  0.766653  0.978246  1.431532  
InĀ [1089]:
# Apply StandardScaler to the features
std_scaler = StandardScaler()
features_scaled = std_scaler.fit_transform(features.dropna())
df_scaled = pd.DataFrame(features_scaled, index=features.dropna().index, columns=features.dropna().columns)

# Print the last few rows of the scaled DataFrame
print(df_scaled.tail(15))
                         f1        f2        f3        f4        f5        f6  \
Date       Ticker                                                               
2024-08-08 AAPL    0.027515  1.021149  0.929630 -0.468616 -0.110455 -0.503036   
           ACN     0.073314 -0.620449 -1.140396 -0.013236 -0.024025 -0.418452   
           ADBE    0.993924  0.882670 -1.027206  0.011624 -0.005302  0.291079   
           AMD     1.776348  1.719028  0.985938 -0.149258 -0.256779 -0.241245   
           AVGO    1.764844  2.393767  0.602534 -0.073377  0.056833 -0.226805   
           AXP     0.566081  0.628802 -1.286510 -0.030500 -0.011696 -0.741386   
           BRK-B   0.267326  0.313882 -0.909604 -0.054040 -0.009849 -0.733971   
           CRM     1.304728  0.866415 -0.698135 -0.012003 -0.081870 -0.260953   
           GS      0.883381  0.658743 -1.198521 -0.023453 -0.003147 -0.607601   
           JPM     0.404871  0.662465 -0.394793 -0.087061 -0.016665 -0.622663   
           MA      0.379093  0.323149 -1.212482 -0.011269 -0.014335 -0.407764   
           MPC     1.317857  0.221435 -1.112593  0.002897 -0.006299 -0.008641   
           MSFT    0.011623  0.637648  0.389193 -0.035456  0.051583 -0.184961   
           NVDA    1.582889  2.076351  2.317720 -0.609655 -3.839028 -0.170521   
           V       0.335001  0.387094 -0.449254 -0.189993 -0.013234 -0.918177   

                         f7        f8        f9       f10  
Date       Ticker                                          
2024-08-08 AAPL    1.308348 -0.550694 -0.281507  1.602402  
           ACN    -0.964237 -0.683457 -0.323049  0.607716  
           ADBE   -0.976996 -0.707346 -0.335857  0.674437  
           AMD     1.163429 -0.451041 -1.686687  1.049550  
           AVGO    0.692396 -0.373184 -0.843167  2.602059  
           AXP    -0.678226 -1.074358 -0.530098  2.187578  
           BRK-B  -0.516889 -0.540937 -0.091369  2.178171  
           CRM    -0.458867 -0.908457 -0.565191  0.814590  
           GS     -0.762472 -0.598449  0.001365  2.146716  
           JPM     0.040986 -0.769796 -0.304681  2.333219  
           MA     -0.845464 -0.757714 -0.178871  1.743342  
           MPC    -1.012686 -0.656671 -0.182449  1.825575  
           MSFT    0.596714 -0.301824 -1.110599  1.565478  
           NVDA    2.425338 -0.266110 -1.180803  2.921908  
           V       0.028465 -0.471610 -0.655780  1.327480  
InĀ [367]:
# Calculate the correlation matrix
corr_matrix = df_scaled.corr()

correlations_array = np.asarray(corr_matrix)

# Perform hierarchical clustering
linkage = hierarchy.linkage(distance.pdist(correlations_array), method='average')

# Plot the cluster map
g = sns.clustermap(corr_matrix, row_linkage=linkage, col_linkage=linkage,
                   row_cluster=True, col_cluster=True, figsize=(10, 10), cmap='Blues')
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
label_order = corr_matrix.iloc[:, g.dendrogram_row.reordered_ind].columns
No description has been provided for this image
InĀ [371]:
# Generate the hierarchical clustering dendrogram
Z = linkage(corr_matrix, 'average')
plt.figure(figsize=(25, 10))
labelsize = 20
ticksize = 15
plt.title('Hierarchical Clustering Dendrogram for Feature Selection', fontsize=labelsize)
plt.xlabel('Features', fontsize=labelsize)
plt.ylabel('Distance', fontsize=labelsize)
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
    labels=corr_matrix.columns
)
pylab.yticks(fontsize=ticksize)
pylab.xticks(rotation=-90, fontsize=ticksize)
plt.savefig('dendrogram_Feature_Selection.png')
plt.show()
No description has been provided for this image

Lasso Regression¶

  • The Lasso regression for each stock identified f10 (the Z-score) as the dominant feature for predicting closing prices.
  • All other features (f1 to f9) were deemed insignificant and thus received coefficients equal to zero.
InĀ [1101]:
rows = len(filtered_stocks) // 5 + (1 if len(filtered_stocks) % 5 != 0 else 0)
fig, axes = plt.subplots(rows, 5, figsize=(20, 5 * rows))
axes = axes.flatten()

for i, stock in enumerate(filtered_stocks):
    stock_features = df_scaled.xs(stock, level='Ticker').iloc[0:-1]
    close_prices = filtered_stock_data['Close'][stock][filtered_stock_data['Close'][stock].index >= '2020-10-16']

    # Ensure the indices align for the Lasso regression
    common_index = stock_features.index.intersection(close_prices.index)
    stock_features = stock_features.loc[common_index]
    close_prices = close_prices.loc[common_index]

    # Fit the Lasso model
    lasso = Lasso()
    lasso.fit(stock_features, close_prices)
    
    # Print the Lasso coefficients
    print(f'Lasso Coefficients for {stock}:')
    print(lasso.coef_)

    # Scatter plot of the predictions vs actual close prices
    axes[i].scatter(lasso.predict(stock_features.iloc[-100:]), close_prices.iloc[-100:])
    axes[i].set_xlabel('Predicted Prices')
    axes[i].set_ylabel('Actual Prices')
    axes[i].set_title(f'Lasso Predictions vs Actual Prices for {stock}')

# Hide any unused subplots
for ax in axes[len(filtered_stocks):]:
    ax.set_visible(False)

plt.tight_layout()
plt.show()
Lasso Coefficients for MSFT:
[ 0.          0.         -0.         -0.         -0.          0.
 -0.         -0.          0.         80.89595679]
Lasso Coefficients for GS:
[ 0.          0.         -0.          0.          0.         -0.
 -0.          0.          0.         75.04839359]
Lasso Coefficients for ADBE:
[  0.           0.          -0.          -0.          -0.
  -0.          -0.          -0.           0.         105.93142107]
Lasso Coefficients for MA:
[ 0.          0.         -0.         -0.         -0.         -0.
 -0.          0.          0.         55.66812595]
Lasso Coefficients for BRK-B:
[ 0.         -0.         -0.         -0.          0.         -0.
 -0.          0.          0.         61.88946648]
Lasso Coefficients for ACN:
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.         55.31763005]
Lasso Coefficients for AAPL:
[ 0.          0.         -0.         -0.          0.          0.
 -0.          0.          0.         42.30230772]
Lasso Coefficients for AXP:
[ 0.        -0.        -0.        -0.        -0.        -0.
 -0.         0.         0.        33.8091874]
Lasso Coefficients for AVGO:
[-0.          0.          0.         -0.          0.          0.
  0.          0.          0.         30.24816648]
Lasso Coefficients for V:
[ 0.          0.         -0.         -0.         -0.         -0.
 -0.          0.          0.         29.35582917]
Lasso Coefficients for AMD:
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.         37.23681318]
Lasso Coefficients for NVDA:
[-0.          0.          0.         -0.         -0.         -0.
  0.         -0.          0.         24.46192753]
Lasso Coefficients for MPC:
[-0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         41.97298818]
Lasso Coefficients for CRM:
[-0.         0.        -0.        -0.        -0.         0.
 -0.        -0.         0.        41.9195327]
Lasso Coefficients for JPM:
[ 0.         -0.         -0.          0.          0.         -0.
 -0.          0.          0.         25.47116204]
No description has been provided for this image

Principal Component Analysis¶

To further explore the effect of different features, PCA is done on all the features for the stocks downloaded from yahoo finance during the data preprocessing and stored in key_statistics dataset. It's observed that 15 principal components explain 95.00% of variance. Moreover the eigen-portfolio weights were plotted to visually represent the effect of different features.

InĀ [1122]:
# Perform PCA
pca = PCA(random_state=84)
pca.fit(data_features_N)
Out[1122]:
PCA(random_state=84)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(random_state=84)
InĀ [1124]:
# Function to plot PCA explained variance
def plotPCA(plot=False):
    var_threshold = 0.95
    var_explained = np.cumsum(pca.explained_variance_ratio_)
    num_comp = np.where(np.logical_not(var_explained < var_threshold))[0][0] + 1
    if plot:
        print('%d principal components explain %.2f%% of variance' % (num_comp, 100 * var_threshold))

        # PCA percent variance explained.
        bar_width = 0.9
        n_asset = var_explained.shape[0]
        x_indx = np.arange(n_asset)
        fig, ax = plt.subplots()

        # Eigenvalues measured as percentage of explained variance.
        rects = ax.bar(x_indx, pca.explained_variance_ratio_[:n_asset], bar_width)
        ax.set_xticks(x_indx + bar_width / 2)
        ax.set_xticklabels(list(range(n_asset)), rotation=45)
        ax.set_title('Percent variance explained')
        ax.set_ylabel('Explained Variance')
        ax.set_xlabel('Principal Components')
        plt.show()

plotPCA(plot=True)
15 principal components explain 95.00% of variance
No description has been provided for this image
InĀ [1126]:
pcs = pca.components_
pcs.shape
Out[1126]:
(30, 84)
InĀ [1128]:
# Plot cumulative explained variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title(' Explained Variance of Features Dataset for filtered stocks with Mean Return > 1.10 ')
plt.show()
No description has been provided for this image
InĀ [1159]:
def PCWeights():
    weights = pd.DataFrame()

    for i in range(len(pcs)):
        weights["weights_{}".format(i)] = pcs[i, :] / sum(pcs[i, :])

    weights = weights.values.T
    return weights

weights = PCWeights()
portfolio = portfolio = pd.DataFrame()

def plotEigen(weights, plot=False, portfolio=portfolio):
    data_feature_N ={'weights': weights.squeeze()}
    #portfolio = pd.DataFrame(data_feature_N ={'weights': weights.squeeze()*100}, index = features_tickers)
    portfolio = pd.DataFrame(data_feature_N, index = features_tickers)
    portfolio.sort_values(by=['weights'], ascending=False, inplace=True)

    if plot:
        print('Sum of weights of current eigen-portfolio: %.2f' % np.sum(portfolio))
        ax = portfolio.plot(title='Current Eigen-Portfolio Weights',
            figsize=(20,8),
            xticks=range(0, len(features_tickers),1),
            rot=80,
            linewidth=5
            )

    plt.tick_params(labelsize = 10)
    plt.xlabel('Features', fontsize=10)
    plt.grid(True)
    plt.title('Eigen-Portfolio Features Weights ',fontdict={'fontsize':10})
    legend = ax.legend(loc=0, ncol=1, bbox_to_anchor=(0, 0, 1,1),fancybox=True,shadow=False,title='Feature',prop={'size':10})
    plt.setp(legend.get_title(),fontsize='10')
    plt.show()

    return portfolio
InĀ [1161]:
# Eigen-Portfolio Features Weights for PC0
plotEigen(weights=weights[0], plot=True)
Sum of weights of current eigen-portfolio: 1.00
No description has been provided for this image
Out[1161]:
weights
priceToSalesTrailing12Months 0.041007
enterpriseToRevenue 0.038891
forwardPE 0.037596
returnOnAssets 0.037534
marketCap 0.036009
... ...
bidSize -0.026761
askSize -0.027178
trailingAnnualDividendYield -0.030380
fiveYearAvgDividendYield -0.030535
dividendYield -0.030988

84 rows Ɨ 1 columns

InĀ [1163]:
# Eigen-Portfolio Features Weights for PC1
plotEigen(weights=weights[1], plot=True)
Sum of weights of current eigen-portfolio: 1.00
No description has been provided for this image
Out[1163]:
weights
targetLowPrice 0.189828
fiftyTwoWeekLow 0.188580
bid 0.186828
twoHundredDayAverage 0.186676
currentPrice 0.186587
... ...
averageVolume -0.120394
averageDailyVolume10Day -0.122629
averageVolume10days -0.122629
sharesShortPriorMonth -0.133092
sharesShort -0.133342

84 rows Ɨ 1 columns

InĀ [1165]:
# Create DataFrame of principal component weights
listpca = []
for k in range(len(pcs)):
    temp = 'PC{}'.format(k)
    listpca.append(temp)

pcs_df = pd.DataFrame(pcs, columns=data_features_N_df.columns, index=listpca).T
print(pcs_df)
                               PC0       PC1       PC2       PC3       PC4  \
dividendRate             -0.048379 -0.130982 -0.020105 -0.021320  0.197945   
dividendYield            -0.149468  0.061498 -0.140336  0.038130  0.105972   
exDividendDate           -0.027124  0.014076 -0.245937 -0.026177 -0.018955   
payoutRatio              -0.117763  0.039783 -0.145802  0.010546  0.024336   
fiveYearAvgDividendYield -0.147285  0.076689 -0.156519 -0.001907  0.068507   
...                            ...       ...       ...       ...       ...   
revenueGrowth             0.155595  0.093477 -0.060585 -0.071897  0.216512   
grossMargins              0.094378 -0.041329  0.128437 -0.097652 -0.075300   
ebitdaMargins             0.120918  0.006686 -0.052787 -0.117143 -0.151845   
operatingMargins          0.140785 -0.063509 -0.108683 -0.108845  0.044952   
trailingPegRatio         -0.030288  0.004814 -0.015103 -0.032012  0.045075   

                               PC5       PC6       PC7       PC8       PC9  \
dividendRate             -0.157050 -0.149759 -0.140911 -0.236856  0.024036   
dividendYield            -0.002725  0.150093  0.018042 -0.097587 -0.119056   
exDividendDate           -0.224770  0.002458 -0.164308  0.104952  0.055360   
payoutRatio              -0.031680  0.202601  0.075190 -0.088681 -0.129055   
fiveYearAvgDividendYield  0.035719  0.161989 -0.037889 -0.135477 -0.059661   
...                            ...       ...       ...       ...       ...   
revenueGrowth            -0.047463  0.047818 -0.045491 -0.013412  0.017357   
grossMargins             -0.222813 -0.050302  0.099113  0.164329  0.003796   
ebitdaMargins            -0.033804  0.048992 -0.068854  0.246931 -0.101479   
operatingMargins         -0.163169  0.002744  0.066693  0.227730 -0.103223   
trailingPegRatio         -0.014909 -0.114944 -0.167734  0.280771 -0.100171   

                          ...      PC20      PC21      PC22      PC23  \
dividendRate              ... -0.082749 -0.097275  0.124855 -0.102519   
dividendYield             ... -0.112359  0.119084  0.050798  0.024250   
exDividendDate            ...  0.108800  0.064336 -0.040363  0.008877   
payoutRatio               ...  0.145035 -0.003125 -0.299832 -0.013908   
fiveYearAvgDividendYield  ... -0.044933  0.009756  0.135016  0.366561   
...                       ...       ...       ...       ...       ...   
revenueGrowth             ...  0.031829  0.031749 -0.008011  0.156973   
grossMargins              ... -0.164450  0.022109 -0.217426  0.088494   
ebitdaMargins             ...  0.067797 -0.297834  0.133059 -0.070754   
operatingMargins          ... -0.072586  0.079647 -0.014682  0.110551   
trailingPegRatio          ...  0.249647 -0.112351  0.014765  0.034876   

                              PC24      PC25      PC26      PC27      PC28  \
dividendRate              0.092744 -0.112563 -0.157720 -0.003112  0.109221   
dividendYield            -0.071480 -0.085306  0.067945  0.025914  0.057856   
exDividendDate           -0.021438 -0.133997  0.003257 -0.065362 -0.050976   
payoutRatio              -0.072947 -0.101622 -0.027105 -0.130891  0.017275   
fiveYearAvgDividendYield -0.020263 -0.083643  0.160487 -0.226531  0.033801   
...                            ...       ...       ...       ...       ...   
revenueGrowth             0.026528 -0.082986 -0.157205  0.114624  0.216012   
grossMargins              0.174646 -0.034326  0.190523  0.057691 -0.185854   
ebitdaMargins             0.053702 -0.070199  0.042529  0.142212 -0.078967   
operatingMargins          0.129262  0.004023 -0.137815  0.048256 -0.009850   
trailingPegRatio         -0.081021 -0.115694 -0.081133 -0.167371  0.037547   

                              PC29  
dividendRate             -0.095440  
dividendYield            -0.429226  
exDividendDate            0.114266  
payoutRatio               0.055607  
fiveYearAvgDividendYield  0.017510  
...                            ...  
revenueGrowth            -0.043651  
grossMargins             -0.078146  
ebitdaMargins             0.047463  
operatingMargins          0.036036  
trailingPegRatio          0.013398  

[84 rows x 30 columns]

Sharpe Ratio¶

This below code finds the most optimized portfolio, in terms of risk-adjusted return (Sharpe Ratio), from a set of 30 portfolios that are constructed using the principal components of the key_statistics dataset for filtered stocks

Eigen Portfolio #1 was identified as having the highest Sharpe Ratio, but both the Return and Sharpe Ratio are displayed as NaN, indicating that there was an issue in the key_statistics dataset.

NOTE: Moving forward the key_statistics dataset will not be used for modelling analysis due to data discrepancy and undefined values.

InĀ [1000]:
# Sharpe Ratio
def sharpe_ratio(ts_returns, periods_per_year=252):
    n_years = ts_returns.shape[0] / periods_per_year
    annualized_return = np.power(np.prod(1 + ts_returns), (1 / n_years)) - 1
    annualized_vol = ts_returns.std() * np.sqrt(periods_per_year)
    annualized_sharpe = annualized_return / annualized_vol

    return annualized_return, annualized_vol, annualized_sharpe

# Optimized Portfolio
def optimizedPortfolio():
    n_portfolios = len(pcs)
    annualized_ret = np.array([0.] * n_portfolios)
    sharpe_metric = np.array([0.] * n_portfolios)
    annualized_vol = np.array([0.] * n_portfolios)
    highest_sharpe = 0
    highest_sharpe_weights = None

    for i in range(n_portfolios):
        pc_w = pcs[i, :] / sum(pcs[i, :])
        data_feature_N = {'weights': pc_w.squeeze()}
        eigen_prtfi = pd.DataFrame(data_feature_N, index=features_tickers)
        eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)

        eigen_prti_returns = np.dot(data_features_N, eigen_prtfi['weights'])
        eigen_prti_returns = pd.Series(eigen_prti_returns, index=data_features_N_df.index)
        er, vol, sharpe = sharpe_ratio(eigen_prti_returns)
        annualized_ret[i] = er
        annualized_vol[i] = vol
        sharpe_metric[i] = sharpe

        if sharpe > highest_sharpe:
            highest_sharpe = sharpe
            highest_sharpe_weights = eigen_prtfi

    highest_sharpe_index = np.argmax(sharpe_metric)

    print(f'Eigen portfolio #{highest_sharpe_index} with the highest Sharpe. Return {annualized_ret[highest_sharpe_index]*100:.2f}%, vol = {annualized_vol[highest_sharpe_index]*100:.2f}%, Sharpe = {sharpe_metric[highest_sharpe_index]:.2f}')

    print('Top features in the highest Sharpe portfolio:')
    print(highest_sharpe_weights.head(10))

    fig, ax = plt.subplots()
    fig.set_size_inches(12, 4)
    ax.plot(sharpe_metric, linewidth=3)
    ax.set_title('Sharpe ratio of eigen-portfolios')
    ax.set_ylabel('Sharpe ratio')
    ax.set_xlabel('Portfolios')

    results = pd.DataFrame(data={'Return': annualized_ret, 'Vol': annualized_vol, 'Sharpe': sharpe_metric})
    results.dropna(inplace=True)
    results.sort_values(by=['Sharpe'], ascending=False, inplace=True)
    print(results.head(10))

    plt.show()

optimizedPortfolio()
Eigen portfolio #1 with the highest Sharpe. Return nan%, vol = 1551.30%, Sharpe = nan
Top features in the highest Sharpe portfolio:
                               weights
fiveYearAvgDividendYield      2.379390
debtToEquity                  2.000594
totalRevenue                  1.714917
ebitda                        1.612266
enterpriseToRevenue           1.084224
numberOfAnalystOpinions       1.053090
revenueGrowth                 1.018931
recommendationMean            0.978078
priceToSalesTrailing12Months  0.974780
shortPercentOfFloat           0.723518
           Return         Vol         Sharpe
23  6.830871e+182  141.240288  4.836348e+180
22   2.633811e+97   61.617070   4.274483e+95
29   1.488836e+95   63.171098   2.356831e+93
28   5.455202e+87   63.461357   8.596101e+85
19   5.586205e+65   54.850169   1.018448e+64
20   6.415324e+52   44.654393   1.436661e+51
18   7.357409e+49   48.284172   1.523773e+48
7    2.186805e+13   28.714243   7.615750e+11
27  -1.000000e+00   25.984649  -3.848426e-02
12  -1.000000e+00   23.306952  -4.290565e-02
No description has been provided for this image

Modeling¶

Simple Linear Regression¶

We choose GS (Goldman Sachs) and ADBE (Adobe) for simple linear regression. The SLR R-square of 0.4333 indicates that 43.33% of the variance in the closing prices of GS can be explained by the closing prices of ADBE using the linear regression model.

InĀ [1003]:
# Tickers
stock_x_ticker = 'ADBE'
stock_y_ticker = 'GS'

# Get the closing prices
x_data = stock_data['Close'][stock_x_ticker].dropna()
y_data = stock_data['Close'][stock_y_ticker].dropna()

# Align the indices
common_index = x_data.index.intersection(y_data.index)
x_data = x_data.loc[common_index]
y_data = y_data.loc[common_index]

# Perform Simple Linear Regression
slr = sm.OLS(y_data, sm.add_constant(x_data)).fit()
slr_prediction = slr.params[0] + slr.params[1] * x_data

print(f'SLR R-squared: {slr.rsquared}')
print(f'SLR Adjusted R-squared: {slr.rsquared_adj}')

plt.figure(figsize=(10, 6))
plt.scatter(x_data, y_data, label='Actual Data')
plt.plot(x_data, slr_prediction, color='red', label='Fitted Line')
plt.xlabel(stock_x_ticker)
plt.ylabel(stock_y_ticker)
plt.title(f'Simple Linear Regression between {stock_x_ticker} and {stock_y_ticker}')
plt.legend()
plt.show()
SLR R-squared: 0.4332853110364281
SLR Adjusted R-squared: 0.432882814808471
No description has been provided for this image

Multiple Linear Regression¶

  • From the MLR output, R-square of 0.915 indicates that approximately 91.5% of the variance in Adobe's (ADBE) closing prices can be explained by the model using the selected basket of 14 stocks as predictor variables.
  • The high F-statistic (1067) and the associated p-value of 0 indicate that the overall regression model is statistically significant and the model's predictors are related to the ADBE closing prices.
  • The high Condition Number of 11,600 suggests that there might be multicollinearity among the predictor variables.
  • The low Durbin-Watson statistic suggests potential autocorrelation in the residuals, which may indicate that some time-dependent structure is not captured by the model.
InĀ [1175]:
# Define the target variable (ADBE)
target_stock = 'ADBE'
y_data = closing_prices[target_stock].dropna()

# Define the predictor variables (all other stocks)
x_data = closing_prices.drop(columns=[target_stock]).dropna()

# Align the indices
common_index = y_data.index.intersection(x_data.index)
x_data = x_data.loc[common_index]
y_data = y_data.loc[common_index]

# Perform Multiple Linear Regression
X = sm.add_constant(np.column_stack((AAPL_close, MSFT_close, NVDA_close, AVGO_close, ORCL_close, CRM_close, AMD_close, ACN_close, CSCO_close, BRKB_close, JPM_close, V_close, MA_close, BAC_close)))
mlr = sm.OLS(y_data, X).fit()

print(f'MLR ADBE and Filtered Basket - R-squared: {mlr.rsquared}')
print(f'MLR ADBE and Filtered Basket - Adjusted R-squared: {mlr.rsquared_adj}')
print(mlr.summary())

# Predict using the MLR model
mlr_prediction = mlr.predict(X)

plt.figure(figsize=(12, 8))
plt.plot(y_data, label='ADBE Actual')
plt.plot(y_data.index, mlr_prediction, label='MLR Prediction', color='red')
plt.xlabel('Date')
plt.ylabel('ADBE Closing Price')
plt.title(f'Multiple Linear Regression for ADBE using Filtered Stock Basket')
plt.legend()
plt.show()
MLR ADBE and Filtered Basket - R-squared: 0.9147368952268922
MLR ADBE and Filtered Basket - Adjusted R-squared: 0.9138799796010317
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   ADBE   R-squared:                       0.915
Model:                            OLS   Adj. R-squared:                  0.914
Method:                 Least Squares   F-statistic:                     1067.
Date:                Fri, 09 Aug 2024   Prob (F-statistic):               0.00
Time:                        04:06:40   Log-Likelihood:                -6948.9
No. Observations:                1408   AIC:                         1.393e+04
Df Residuals:                    1393   BIC:                         1.401e+04
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -91.9516     14.773     -6.224      0.000    -120.931     -62.972
x1             0.7230      0.093      7.806      0.000       0.541       0.905
x2             0.5874      0.096      6.094      0.000       0.398       0.777
x3             0.2788      0.245      1.139      0.255      -0.201       0.759
x4            -0.1793      0.225     -0.798      0.425      -0.620       0.262
x5             0.4350      0.205      2.122      0.034       0.033       0.837
x6             1.2542      0.052     24.334      0.000       1.153       1.355
x7            -0.8751      0.127     -6.875      0.000      -1.125      -0.625
x8             0.8828      0.087     10.182      0.000       0.713       1.053
x9             2.3370      0.294      7.955      0.000       1.761       2.913
x10           -1.3736      0.068    -20.059      0.000      -1.508      -1.239
x11            0.4377      0.160      2.727      0.006       0.123       0.752
x12            0.8918      0.200      4.465      0.000       0.500       1.284
x13           -0.2623      0.108     -2.423      0.016      -0.475      -0.050
x14           -1.6687      0.480     -3.479      0.001      -2.610      -0.728
==============================================================================
Omnibus:                       34.354   Durbin-Watson:                   0.059
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               40.072
Skew:                          -0.322   Prob(JB):                     1.99e-09
Kurtosis:                       3.518   Cond. No.                     1.16e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
No description has been provided for this image

Support Vector Regression (SVR)¶

  • The SVR model using AAPL's closing prices provides a moderate prediction of ADBE's prices based on the R-square values. The model only explains 52.6% of the variance on the training set and 49.1% on the test set. The close MSE and MAE values between the training and test sets suggest that the model does not suffer from severe overfitting and generalizes relatively well to unseen data.
  • Using the basket of selected stocks significantly improves the SVR model's ability to predict ADBE's closing prices compared to using a single stock like AAPL. The basket model explains 79.7% of the variance on the training set and 78.7% on the test set. The lower MSE ands MAE values compared to the single-stock model suggests improved accuracy in the model's predictions when using the basket of stocks.
InĀ [546]:
# Define the target variable (ADBE)
target_stock = 'ADBE'
y_data = closing_prices[target_stock].dropna()

# Define the predictor variables (all other stocks)
x_data = closing_prices.drop(columns=[target_stock]).dropna()

# Align the indices
common_index = y_data.index.intersection(x_data.index)
x_data = x_data.loc[common_index]
y_data = y_data.loc[common_index]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

# SVR with one stock (e.g., AAPL)
svr_model_single = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1.0, epsilon=0.2))
X_train_single = X_train[['AAPL']]
X_test_single = X_test[['AAPL']]
svr_model_single.fit(X_train_single, y_train)
y_train_pred_single = svr_model_single.predict(X_train_single)
y_test_pred_single = svr_model_single.predict(X_test_single)

# Evaluate the single stock model
train_mse_single = mean_squared_error(y_train, y_train_pred_single)
test_mse_single = mean_squared_error(y_test, y_test_pred_single)
train_mae_single = mean_absolute_error(y_train, y_train_pred_single)
test_mae_single = mean_absolute_error(y_test, y_test_pred_single)
train_r2_single = r2_score(y_train, y_train_pred_single)
test_r2_single = r2_score(y_test, y_test_pred_single)

print('SVR with Single Stock (AAPL):')
print(f'Train MSE: {train_mse_single}')
print(f'Test MSE: {test_mse_single}')
print(f'Train MAE: {train_mae_single}')
print(f'Test MAE: {test_mae_single}')
print(f'Train R-squared: {train_r2_single}')
print(f'Test R-squared: {test_r2_single}')
print()
SVR with Single Stock (AAPL):
Train MSE: 6351.090442650765
Test MSE: 6545.172635117789
Train MAE: 59.08351976513713
Test MAE: 59.8118066194894
Train R-squared: 0.5258430089690345
Test R-squared: 0.49137295508659706

InĀ [548]:
# Plot actual vs predicted values for testing sets for single stock (AAPL)
plt.figure(figsize=(14, 7))
plt.plot(y_test.index, y_test, label='Actual Test ADBE')
plt.plot(y_test.index, y_test_pred_single, label='SVR Prediction (AAPL)', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('ADBE Closing Price')
plt.title('SVR for ADBE using AAPL')
plt.legend()
plt.show()
No description has been provided for this image
InĀ [550]:
# SVR with the entire basket of stocks
svr_model_basket = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1.0, epsilon=0.2))
svr_model_basket.fit(X_train, y_train)
y_train_pred_basket = svr_model_basket.predict(X_train)
y_test_pred_basket = svr_model_basket.predict(X_test)

# Evaluate the basket model
train_mse_basket = mean_squared_error(y_train, y_train_pred_basket)
test_mse_basket = mean_squared_error(y_test, y_test_pred_basket)
train_mae_basket = mean_absolute_error(y_train, y_train_pred_basket)
test_mae_basket = mean_absolute_error(y_test, y_test_pred_basket)
train_r2_basket = r2_score(y_train, y_train_pred_basket)
test_r2_basket = r2_score(y_test, y_test_pred_basket)

print('SVR with Basket of Stocks:')
print(f'Train MSE: {train_mse_basket}')
print(f'Test MSE: {test_mse_basket}')
print(f'Train MAE: {train_mae_basket}')
print(f'Test MAE: {test_mae_basket}')
print(f'Train R-squared: {train_r2_basket}')
print(f'Test R-squared: {test_r2_basket}')
print()
SVR with Basket of Stocks:
Train MSE: 2716.2138820025752
Test MSE: 2740.6975790822753
Train MAE: 39.99276524814664
Test MAE: 40.35452036488289
Train R-squared: 0.797214066951731
Test R-squared: 0.7870196878886682

InĀ [552]:
# Plot actual vs predicted values for testing sets for basket of stocks
plt.figure(figsize=(14, 7))
plt.plot(y_test.index, y_test, label='Actual Test ADBE')
plt.plot(y_test.index, y_test_pred_basket, label='SVR Prediction (Basket)', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('ADBE Closing Price')
plt.title('SVR for ADBE using Basket of Stocks')
plt.legend()
plt.show()
No description has been provided for this image
InĀ [1187]:
sns.pairplot(closing_prices)
plt.title('Pair Plot of Stock Prices')
plt.show()
No description has been provided for this image

Cointegration¶

9 pairs of stocks were identified as cointegrated suggesting potential trading opportunities where the price movements of these stocks may be predictably related over the long term. These pairs can be useful for pairs trading strategies

  • AMD and MA appear in the most pairs, indicating that they have significant cointegrated relationships with several other stocks.
  • MLR and Dickey-Fuller tests were performed on cointegrated pairs. The R-squared values for the MLR models range from 64.9% to 97.4% suggesting a strong linear relationship between the pairs.
  • The ADF statistics for all pairs are negative, and the p-values are all below 0.05 providing strong evidence of cointegration between the pairs.
InĀ [563]:
# Cointegration analysis
def find_cointegrated_pairs(data):
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(n):
        for j in range(i + 1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs
InĀ [567]:
# Perform cointegration analysis
scores, pvalues, pairs = find_cointegrated_pairs(closing_prices)

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(pvalues, xticklabels=stocks, yticklabels=stocks, cmap='RdYlGn_r', mask=(pvalues >= 0.99))
plt.title('Cointegration P-Values Heatmap')
plt.show()
No description has been provided for this image
InĀ [569]:
# Print the cointegrated pairs
print("Cointegrated pairs:")
print(pairs)
Cointegrated pairs:
[('ADBE', 'CRM'), ('AMD', 'MA'), ('AMD', 'MSFT'), ('AMD', 'V'), ('MA', 'MSFT'), ('MA', 'NVDA'), ('MA', 'ORCL'), ('MA', 'V'), ('MSFT', 'V')]
InĀ [1198]:
# Function to perform MLR and Dickey-Fuller test on cointegrated pairs
def basket_testing(cointegrated_pairs, data):
    results = []
    for pair in cointegrated_pairs:
        stock1, stock2 = pair
        y = data[stock1]
        X = data[stock2]
        X = sm.add_constant(X)  # Adding a constant term to the model
        
        # Perform MLR
        model = sm.OLS(y, X).fit()
        residuals = model.resid
        
        # Perform Dickey-Fuller test on the residuals
        adf_result = adfuller(residuals)
        
        # Store relevant results
        results.append({
            'Pair': pair,
            'MLR R-squared': model.rsquared,
            'ADF Statistic': adf_result[0],
            'ADF p-value': adf_result[1]
        })
        
    return results

# Example usage:
results = basket_testing(pairs, closing_prices)

# Print the results for each pair
for result in results:
    print(f"Pair: {result['Pair']}")
    print(f"MLR R-squared: {result['MLR R-squared']:.4f}")
    print(f"ADF Statistic: {result['ADF Statistic']:.4f}")
    print(f"ADF p-value: {result['ADF p-value']:.4f}")
    print("\n" + "="*50 + "\n")
Pair: ('ADBE', 'CRM')
MLR R-squared: 0.7713
ADF Statistic: -3.3803
ADF p-value: 0.0116

==================================================

Pair: ('AMD', 'MA')
MLR R-squared: 0.7973
ADF Statistic: -3.8978
ADF p-value: 0.0021

==================================================

Pair: ('AMD', 'MSFT')
MLR R-squared: 0.9189
ADF Statistic: -4.0061
ADF p-value: 0.0014

==================================================

Pair: ('AMD', 'V')
MLR R-squared: 0.7817
ADF Statistic: -3.8457
ADF p-value: 0.0025

==================================================

Pair: ('MA', 'MSFT')
MLR R-squared: 0.8628
ADF Statistic: -4.4840
ADF p-value: 0.0002

==================================================

Pair: ('MA', 'NVDA')
MLR R-squared: 0.6493
ADF Statistic: -3.5706
ADF p-value: 0.0063

==================================================

Pair: ('MA', 'ORCL')
MLR R-squared: 0.7647
ADF Statistic: -3.7994
ADF p-value: 0.0029

==================================================

Pair: ('MA', 'V')
MLR R-squared: 0.9740
ADF Statistic: -4.6446
ADF p-value: 0.0001

==================================================

Pair: ('MSFT', 'V')
MLR R-squared: 0.8567
ADF Statistic: -3.4441
ADF p-value: 0.0095

==================================================

The plots for the actual price of the target stocks against the predicted prices from the MLR model for each cointegrated pair are shown below.

InĀ [1208]:
# Function to perform MLR, plot actual vs predicted prices
def plot_mlr_vs_actual_in_grids(cointegrated_pairs, data):
    num_pairs = len(cointegrated_pairs)
    num_rows = (num_pairs // 3) + (1 if num_pairs % 3 != 0 else 0)
    
    fig, axes = plt.subplots(num_rows, 3, figsize=(18, num_rows * 6))
    axes = axes.flatten()

    for i, pair in enumerate(cointegrated_pairs):
        stock1, stock2 = pair
        y = data[stock1]
        X = data[stock2]
        X = sm.add_constant(X)  # Adding a constant term to the model
        
        # Perform MLR
        model = sm.OLS(y, X).fit()
        mlr_prediction = model.predict(X)
        mlr_prediction.name = f"{stock1}_predicted_by_{stock2}"
        
        # Plot actual vs predicted prices
        pd.concat([y, mlr_prediction], axis=1).plot(ax=axes[i])
        axes[i].set_title(f"{stock1} vs {mlr_prediction.name}")
        axes[i].set_ylabel('Price Levels')
        axes[i].legend([stock1, mlr_prediction.name])
    
    # Hide any unused subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()

plot_mlr_vs_actual_in_grids(pairs, closing_prices)
No description has been provided for this image

Portfolio Asset Allocation¶

For conducting portfolio analysis on the 15 filtered stocks, historical data of various asset classes such as indices and bonds have been added. The adjusted Close prices for stocks, bonds have been plotted below

  • OLS regression analysis is conducted between the excess return of the stock basket over the risk-free rate (dependent variable) and the market return (independent variable represented by SPY ETF).
    • A very high R-square of 91.7% suggests a strong relationship between the portfolio and the market
    • A very high F-statistic of 13780 indicates that the overall regression model is statistically significant
    • SPY cooeffiecint of 1.2009 is the beta of the portfolio which indicates that the portfolio is 20% more volatile than the market
InĀ [1236]:
stock_data = yf.download(['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO', 'BRK-B', 'JPM', 'V', 'MA', 'BAC'], start="2019-01-01")
index_data = yf.download(['SPY', '^GSPC','^DJI','^IXIC','^RUT', 'BIL'], start="2019-01-01")
bond_data = yf.download(['^IRX', '^FVX', '^TNX', 'SHY', 'LQD', 'BND', 'GOVT'], start="2019-01-01")
# etf_data = yf.download(['IVV','QQQ','XLV','VEA','EEM','FEZ','GLD','VNQ','ARKK'], start="2019-01-01")
[*********************100%%**********************]  15 of 15 completed
[*********************100%%**********************]  6 of 6 completed
[*********************100%%**********************]  7 of 7 completed
InĀ [1228]:
# Process stock data
prices_yf_adj_close = stock_data['Adj Close']
prices_yf_adj_close = prices_yf_adj_close.dropna()

# Process index data
prices_indx_close = index_data['Close']
prices_indx_close = prices_indx_close.dropna()

# Process bond data
USbond_yf_adj_close = bond_data['Adj Close']
USbond_yf_adj_close = USbond_yf_adj_close.dropna()
InĀ [1230]:
# Plot stock prices
plt.figure(figsize=(14, 7))
for c in prices_yf_adj_close.columns.values:
    plt.plot(prices_yf_adj_close.index, prices_yf_adj_close[c], lw=3, alpha=0.8, label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Price in $')
plt.title('Adjusted Close Prices of Stocks')
plt.show()
No description has been provided for this image
InĀ [1232]:
# Plot bond prices
plt.figure(figsize=(14, 7))
for c in USbond_yf_adj_close.columns.values:
    plt.plot(USbond_yf_adj_close.index, USbond_yf_adj_close[c], lw=3, alpha=0.8, label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Price in $')
plt.title('Adjusted Close Prices of Bonds')
plt.show()
No description has been provided for this image
InĀ [1005]:
# Define training and test date ranges
tr_start_date = '2019-01-01'
tr_end_date = '2023-12-31'
te_start_date = '2024-01-01'
te_end_date = '2024-08-01'

# Split the data into training and test sets
prices_yf_adj_close_train = prices_yf_adj_close.loc[tr_start_date:tr_end_date]
prices_yf_adj_close_test = prices_yf_adj_close.loc[te_start_date:te_end_date]

USbond_yf_adj_close_train = USbond_yf_adj_close.loc[tr_start_date:tr_end_date]
USbond_yf_adj_close_test = USbond_yf_adj_close.loc[te_start_date:te_end_date]

prices_indx_close_train = prices_indx_close.loc[tr_start_date:tr_end_date]
prices_indx_close_test = prices_indx_close.loc[te_start_date:te_end_date]

# Calculate returns
Returns = prices_yf_adj_close_train.pct_change().dropna()

# Calculate equally weighted returns
weights = np.ones(Returns.shape[1]) * 1 / Returns.shape[1]
df_wgts = pd.DataFrame(weights, index=Returns.columns, columns=['Weights'])

# Determine portfolio return based on equal weighted allocation
Basket_Ret = Returns.dot(df_wgts)

# Put results back into Return matrix
Returns['Basket_Ret'] = Basket_Ret

# Portfolio return for training period
Basket_R_train = Returns['Basket_Ret']

# Risk-free proxy
R_F = prices_indx_close_train['BIL'].pct_change()[1:]

# Market returns
M = prices_indx_close_train['SPY'].pct_change()[1:]

# Adjust the returns by subtracting the risk-free rate
Y = Basket_R_train - R_F.loc[Basket_R_train.index]

# Add a constant (intercept) to the explanatory variables
X = sm.add_constant(M.loc[Basket_R_train.index])

# Multiple Linear Regression
Basket_train_results = regression.linear_model.OLS(Y, X).fit()

# Beta of the basket
Basket_beta_train = Basket_train_results.params[1]

# Plot returns
plt.figure(figsize=(14, 7))
M.plot(label='Market (SPY)')
Basket_R_train.plot(label='Stock Basket Return')
R_F.plot(label='Risk-Free (BIL)')
plt.xlabel('Time')
plt.ylabel('Daily Percent Return')
plt.legend()
plt.show()

# Print regression summary
print(Basket_train_results.summary())
No description has been provided for this image
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.917
Model:                            OLS   Adj. R-squared:                  0.916
Method:                 Least Squares   F-statistic:                 1.378e+04
Date:                Thu, 08 Aug 2024   Prob (F-statistic):               0.00
Time:                        17:58:08   Log-Likelihood:                 4925.2
No. Observations:                1257   AIC:                            -9846.
Df Residuals:                    1255   BIC:                            -9836.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0004      0.000      2.964      0.003       0.000       0.001
SPY            1.2009      0.010    117.372      0.000       1.181       1.221
==============================================================================
Omnibus:                      172.522   Durbin-Watson:                   1.904
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              676.467
Skew:                           0.611   Prob(JB):                    1.28e-147
Kurtosis:                       6.380   Cond. No.                         75.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The predicted returns and the predicted cumulative returns closely trail the actual return and the actual cumulative returns of stock Basket which is a good indicatior for our portfolio.

InĀ [683]:
# Prediction for the test period
predictions = R_F_test + Basket_beta_train * (M_test - R_F_test)  # CAPM equation

# Plot the predictions against actual returns
plt.figure(figsize=(14, 7))
predictions.plot(label='Prediction')
Basket_R_test.plot(label='Actual Return', color='y')
plt.legend(['Prediction', 'Actual Return'])
plt.xlabel('Time')
plt.ylabel('Daily Percent Return')
plt.title('Predicted vs Actual Returns of Stock Basket')
plt.show()
No description has been provided for this image
InĀ [685]:
# Calculate cumulative returns
predictions_cum = predictions.cumsum() + 1
Basket_cum = Basket_R_test.cumsum() + 1

# Plot the cumulative returns
plt.figure(figsize=(14, 7))
predictions_cum.plot(label='Prediction_cum')
Basket_cum.plot(label='Actual Return_cum', color='b')
plt.legend(['Prediction_cum', 'Actual Return_cum'])
plt.xlabel('Time')
plt.ylabel('Cumulative Return')
plt.title('Predicted vs Actual Cumulative Returns of Stock Basket')
plt.show()
No description has been provided for this image
  • The Maximum Sharpe Ratio Portfolio below has a sharpe ratio of 0.76, annualized return of 29% and annualized volatity of 22%, which gives us the highest return with a significant allocation to both high-growth stocks and some bond ETFs to balance risk.
  • The Minimum Volatility Portfolio has an annualized return of 16% and annualized volatity of 16% and focuses on minimizing risk by allocating more towards bonds and stable stocks.
InĀ [1275]:
def portfolio_annualised_performance(weights, mean_returns, cov_matrix):
  returns = np.sum(mean_returns*weights) * 252 # Number of trading days in the year
  std = np.sqrt(np.dot(weights.T, np.dot(cov_matrix, weights))) * np.sqrt(252)
  return std, returns
InĀ [1277]:
def random_portfolios(num_portfolios, returns, cov_matrix, risk_free_rate):
  results = np.zeros((3,num_portfolios))
  weights_record = []
  mean_returns = returns.mean()
  for i in range(num_portfolios):
      weights = np.random.random(returns.shape[1])
      weights /= np.sum(weights)
      weights_record.append(weights)
      portfolio_std_dev, portfolio_return = portfolio_annualised_performance(weights, mean_returns, cov_matrix)
      results[0,i] = portfolio_std_dev
      results[1,i] = portfolio_return
      results[2,i] = (portfolio_return - risk_free_rate) / portfolio_std_dev # Sharpe Ratio
  return results, weights_record
InĀ [1279]:
fyprices_yf_adj_close = prices_yf_adj_close.loc['2019-01-01':tr_end_date]
fyUSbond_yf_adj_close = USbond_yf_adj_close.loc['2019-01-01':tr_end_date]
InĀ [1281]:
# Equity Returns
EQreturns = fyprices_yf_adj_close.pct_change()
EQreturns = EQreturns.dropna()
mean_EQreturns = EQreturns.mean()
EQcov_matrix = EQreturns.cov()

# Money Market Returns
MMreturns = fyUSbond_yf_adj_close.pct_change()
MMreturns = MMreturns.dropna()
mean_MMreturns = MMreturns.mean()
MMcov_matrix = MMreturns.cov()

# Concatenating returns
# Place the DataFrames side by side
PortfolioRet = pd.concat([EQreturns, MMreturns], axis=1)
mean_PortfolioRet = PortfolioRet.mean()
cov_matrix = PortfolioRet.cov()

num_portfolios = 25000
risk_free_rate = 0.01
InĀ [1283]:
def display_simulated_ef_with_random(returns, cov_matrix, num_portfolios, risk_free_rate):
    results, weights = random_portfolios(num_portfolios,returns, cov_matrix, risk_free_rate)

    max_sharpe_idx = np.argmax(results[2])
    print(results[2][max_sharpe_idx])
    sdp, rp = results[0,max_sharpe_idx], results[1,max_sharpe_idx]
    max_sharpe_allocation = pd.DataFrame(weights[max_sharpe_idx],index=returns.columns,columns=['allocation'])
    max_sharpe_allocation.allocation = [round(i*100,2)for i in max_sharpe_allocation.allocation]
    max_sharpe_allocation = max_sharpe_allocation.T

    min_vol_idx = np.argmin(results[0])
    sdp_min, rp_min = results[0,min_vol_idx], results[1,min_vol_idx]
    min_vol_allocation = pd.DataFrame(weights[min_vol_idx],index=returns.columns,columns=['allocation'])
    min_vol_allocation.allocation = [round(i*100,2)for i in min_vol_allocation.allocation]
    min_vol_allocation = min_vol_allocation.T

    print ("-"*80)
    print ("Maximum Sharpe Ratio Portfolio Allocation\n")
    print ("Annualised Return:", round(rp,2))
    print ("Annualised Volatility:", round(sdp,2))
    print ("\n")
    print (max_sharpe_allocation)
    print ("-"*80)
    print ("Minimum Volatility Portfolio Allocation\n")
    print ("Annualised Return:", round(rp_min,2))
    print ("Annualised Volatility:", round(sdp_min,2))
    print ("\n")
    print (min_vol_allocation)

    plt.figure(figsize=(15, 10))
    plt.scatter(results[0,:],results[1,:],c=results[2,:],cmap='YlGnBu', marker='o', s=10, alpha=0.3)
    plt.colorbar()
    plt.scatter(sdp,rp,marker='*',color='r',s=500, label='Maximum Sharpe ratio')
    plt.scatter(sdp_min,rp_min,marker='*',color='g',s=500, label='Minimum volatility')
    plt.title('Simulated Portfolio Optimization based on Efficient Frontier')
    plt.xlabel('annualised volatility')
    plt.ylabel('annualised returns')
    plt.legend(labelspacing=0.8)
    return results, rp , sdp, rp_min, sdp_min
InĀ [1285]:
results, rp , sdp, rp_min, sdp_min = display_simulated_ef_with_random(PortfolioRet, cov_matrix, num_portfolios, risk_free_rate)
1.2015685257156334
--------------------------------------------------------------------------------
Maximum Sharpe Ratio Portfolio Allocation

Annualised Return: 0.28
Annualised Volatility: 0.22


Ticker      AAPL   ACN  ADBE   AMD  AVGO   BAC  BRK-B   CRM  CSCO   JPM  ...  \
allocation  7.01  9.48   1.5  7.16  6.85  6.05   2.71  0.77  0.67  7.68  ...   

Ticker      NVDA  ORCL     V  BND  GOVT   LQD   SHY  ^FVX  ^IRX  ^TNX  
allocation  9.25  2.31  0.16  5.6  6.19  1.06  8.66  7.81  0.48  1.57  

[1 rows x 22 columns]
--------------------------------------------------------------------------------
Minimum Volatility Portfolio Allocation

Annualised Return: 0.18
Annualised Volatility: 0.16


Ticker      AAPL   ACN  ADBE   AMD  AVGO   BAC  BRK-B   CRM  CSCO   JPM  ...  \
allocation  6.02  3.23  0.61  0.16  4.36  1.27   0.31  2.91  3.92  9.01  ...   

Ticker      NVDA  ORCL     V   BND  GOVT    LQD   SHY  ^FVX  ^IRX  ^TNX  
allocation  4.99  2.53  2.34  9.84  10.6  11.45  8.46  1.02  0.57  2.58  

[1 rows x 22 columns]
No description has been provided for this image
InĀ [1288]:
# exclude 13 Wk treasury Bills , as very volatile
exclude2 = ['^IRX']
#exclude2 = []
fyUSbond_yf_adj_close = fyUSbond_yf_adj_close.loc[:, fyUSbond_yf_adj_close.columns.difference(exclude2)]
InĀ [1290]:
# Redo Equity Returns
EQreturns = fyprices_yf_adj_close.pct_change()
mean_EQreturns = EQreturns.mean()
EQcov_matrix = EQreturns.cov()

# Money Market Returns
MMreturns = fyUSbond_yf_adj_close.pct_change()
mean_MMreturns = MMreturns.mean()
MMcov_matrix = MMreturns.cov()

# Concatenating returns
# Place the DataFrames side by side
PortfolioRet = pd.concat([EQreturns, MMreturns], axis=1)
mean_PortfolioRet = PortfolioRet.mean()
cov_matrix = PortfolioRet.cov()
InĀ [1292]:
results, rp , sdp, rp_min, sdp_min = display_simulated_ef_with_random(PortfolioRet, cov_matrix, num_portfolios, risk_free_rate)
1.2285121027596466
--------------------------------------------------------------------------------
Maximum Sharpe Ratio Portfolio Allocation

Annualised Return: 0.29
Annualised Volatility: 0.22


Ticker      AAPL   ACN  ADBE  AMD  AVGO   BAC  BRK-B   CRM  CSCO   JPM  ...  \
allocation   9.4  4.46  3.86  9.7  6.59  0.45   1.47  2.11  0.43  5.99  ...   

Ticker      MSFT   NVDA  ORCL     V   BND  GOVT  LQD   SHY  ^FVX  ^TNX  
allocation  1.09  10.37  3.25  1.24  7.61  3.73  5.7  7.73  5.54  3.83  

[1 rows x 21 columns]
--------------------------------------------------------------------------------
Minimum Volatility Portfolio Allocation

Annualised Return: 0.16
Annualised Volatility: 0.16


Ticker      AAPL  ACN  ADBE   AMD  AVGO   BAC  BRK-B   CRM  CSCO   JPM  ...  \
allocation  6.21  3.5  0.46  1.08  0.01  1.48   9.18  4.19  7.11  4.91  ...   

Ticker      MSFT  NVDA  ORCL     V   BND  GOVT   LQD   SHY  ^FVX  ^TNX  
allocation  7.48  0.87  8.27  2.79  8.13  7.67  7.26  9.32  1.45  4.53  

[1 rows x 21 columns]
No description has been provided for this image

Efficient Frontier¶

The Efficient Frontier below helps identify the best possible portfolios that can be constructed with our basket. We simulate multiple random portfolios and calculate their risk and return characteristics. The code then visualizes the Efficient Frontier along with the simulated portfolios, highlighting the optimal portfolios

  • It's observed that while the annualized return has dropped from 29% to 11%, the sharpe ratio has also reduced from 0.76 to 0.64 and the annualized volatity has drastically reduced from 22% to 7%
  • however 34% allaocation has been given to BOND which I may tweak lower to increase the annualized returns as part of my investment strategy.
InĀ [1295]:
def neg_sharpe_ratio(weights, mean_returns, cov_matrix, risk_free_rate):
    p_var, p_ret = portfolio_annualised_performance(weights, mean_returns, cov_matrix)
    return -(p_ret - risk_free_rate) / p_var

def max_sharpe_ratio(mean_returns, cov_matrix, risk_free_rate):
    num_assets = len(mean_returns)
    args = (mean_returns, cov_matrix, risk_free_rate)
    constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bound = (0.0,1.0)
    bounds = tuple(bound for asset in range(num_assets))
    result = sco.minimize(neg_sharpe_ratio, num_assets*[1./num_assets,], args=args,
                        method='SLSQP', bounds=bounds, constraints=constraints)
    return result
InĀ [1297]:
def portfolio_volatility(weights, mean_returns, cov_matrix):
    return portfolio_annualised_performance(weights, mean_returns, cov_matrix)[0]

def min_variance(mean_returns, cov_matrix):
    num_assets = len(mean_returns)
    args = (mean_returns, cov_matrix)
    constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bound = (0.0,1.0)
    bounds = tuple(bound for asset in range(num_assets))

    result = sco.minimize(portfolio_volatility, num_assets*[1./num_assets,], args=args,
                        method='SLSQP', bounds=bounds, constraints=constraints)

    return result
InĀ [1299]:
def efficient_return(mean_returns, cov_matrix, target):
    num_assets = len(mean_returns)
    args = (mean_returns, cov_matrix)

    def portfolio_return(weights):
        return portfolio_annualised_performance(weights, mean_returns, cov_matrix)[1]

    constraints = ({'type': 'eq', 'fun': lambda x: portfolio_return(x) - target},
                   {'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bounds = tuple((0,1) for asset in range(num_assets))
    result = sco.minimize(portfolio_volatility, num_assets*[1./num_assets,], args=args, method='SLSQP', bounds=bounds, constraints=constraints)
    #print(result)
    return result


def efficient_frontier(mean_returns, cov_matrix, returns_range):
    efficients = []
    for ret in returns_range:
        efficients.append(efficient_return(mean_returns, cov_matrix, ret))
    return efficients
InĀ [1301]:
def display_calculated_ef_with_random(returns, cov_matrix, num_portfolios, risk_free_rate):
    results, _ = random_portfolios(num_portfolios,returns, cov_matrix, risk_free_rate)

    mean_returns = returns.mean()
    max_sharpe = max_sharpe_ratio(mean_returns, cov_matrix, risk_free_rate)
    sdp, rp = portfolio_annualised_performance(max_sharpe['x'], mean_returns, cov_matrix)
    max_sharpe_allocation = pd.DataFrame(max_sharpe.x,index=returns.columns,columns=['allocation'])
    max_sharpe_allocation.allocation = [round(i*100,2)for i in max_sharpe_allocation.allocation]
    max_sharpe_allocation = max_sharpe_allocation.T
    max_sharpe_allocation

    min_vol = min_variance(mean_returns, cov_matrix)
    sdp_min, rp_min = portfolio_annualised_performance(min_vol['x'], mean_returns, cov_matrix)
    min_vol_allocation = pd.DataFrame(min_vol.x,index=returns.columns,columns=['allocation'])
    min_vol_allocation.allocation = [round(i*100,2)for i in min_vol_allocation.allocation]
    min_vol_allocation = min_vol_allocation.T

    print ("-"*80)
    print ("Maximum Sharpe Ratio Portfolio Allocation\n")
    print ("Annualised Return:", round(rp,2))
    print ("Annualised Volatility:", round(sdp,2))
    print ("\n")
    print (max_sharpe_allocation)
    print ("-"*80)
    print ("Minimum Volatility Portfolio Allocation\n")
    print ("Annualised Return:", round(rp_min,2))
    print ("Annualised Volatility:", round(sdp_min,2))
    print ("\n")
    print (min_vol_allocation)

    plt.figure(figsize=(15, 10))
    plt.scatter(results[0,:],results[1,:],c=results[2,:],cmap='YlGnBu', marker='o', s=20, alpha=0.3)
    plt.colorbar()
    plt.scatter(sdp,rp,marker='*',color='r',s=500, label='Maximum Sharpe ratio')
    plt.scatter(sdp_min,rp_min,marker='*',color='g',s=500, label='Minimum volatility')

    target = np.linspace(rp_min, 0.32, 50)
    efficient_portfolios = efficient_frontier(mean_returns, cov_matrix, target)
    plt.plot([p['fun'] for p in efficient_portfolios], target, linestyle='-.', color='black', label='efficient frontier')
    #plt.title('Calculated Portfolio Optimization based on Efficient Frontier')
    plt.title('Portfolio Optimization based on Efficient Frontier')
    plt.xlabel('annualised volatility')
    plt.ylabel('annualised returns')
    plt.legend(labelspacing=0.8)
InĀ [1303]:
display_calculated_ef_with_random(PortfolioRet, cov_matrix, num_portfolios, risk_free_rate)
--------------------------------------------------------------------------------
Maximum Sharpe Ratio Portfolio Allocation

Annualised Return: 0.11
Annualised Volatility: 0.07


Ticker      AAPL  ACN  ADBE   AMD  AVGO  BAC  BRK-B  CRM  CSCO  JPM  ...  \
allocation  6.35  0.0   0.0  0.11  1.85  0.0    0.0  0.0   0.0  0.0  ...   

Ticker      MSFT  NVDA  ORCL    V  BND   GOVT  LQD    SHY  ^FVX  ^TNX  
allocation   0.0   8.0   0.0  0.0  0.0  34.22  0.0  44.96   4.5   0.0  

[1 rows x 21 columns]
--------------------------------------------------------------------------------
Minimum Volatility Portfolio Allocation

Annualised Return: 0.02
Annualised Volatility: 0.02


Ticker      AAPL  ACN  ADBE  AMD  AVGO  BAC  BRK-B  CRM  CSCO  JPM  ...  MSFT  \
allocation   0.0  0.0   0.0  0.0   0.0  0.0    0.0  0.0   0.0  0.0  ...   0.0   

Ticker      NVDA  ORCL    V  BND  GOVT  LQD    SHY  ^FVX  ^TNX  
allocation   0.0   0.0  0.0  0.0   0.0  0.0  98.58  0.83   0.6  

[1 rows x 21 columns]
No description has been provided for this image
InĀ [1308]:
def display_ef_with_selected(returns, cov_matrix, risk_free_rate):

    mean_returns = returns.mean()
    max_sharpe = max_sharpe_ratio(mean_returns, cov_matrix, risk_free_rate)
    sdp, rp = portfolio_annualised_performance(max_sharpe['x'], mean_returns, cov_matrix)
    max_sharpe_allocation = pd.DataFrame(max_sharpe.x,index=returns.columns,columns=['allocation'])
    max_sharpe_allocation.allocation = [round(i*100,2)for i in max_sharpe_allocation.allocation]
    max_sharpe_allocation = max_sharpe_allocation.T
    max_sharpe_allocation

    min_vol = min_variance(mean_returns, cov_matrix)
    sdp_min, rp_min = portfolio_annualised_performance(min_vol['x'], mean_returns, cov_matrix)
    min_vol_allocation = pd.DataFrame(min_vol.x,index=returns.columns,columns=['allocation'])
    min_vol_allocation.allocation = [round(i*100,2)for i in min_vol_allocation.allocation]
    min_vol_allocation = min_vol_allocation.T

    an_vol = np.std(returns) * np.sqrt(252)
    an_rt = mean_returns * 252

    print ("-"*80)
    print ("Maximum Sharpe Ratio Portfolio Allocation\n")
    print ("Annualised Return:", round(rp,2))
    print ("Annualised Volatility:", round(sdp,2))
    print ("\n")
    print (max_sharpe_allocation)
    print ("-"*80)
    print ("Minimum Volatility Portfolio Allocation\n")
    print ("Annualised Return:", round(rp_min,2))
    print ("Annualised Volatility:", round(sdp_min,2))
    print ("\n")
    print (min_vol_allocation)
    print ("-"*80)
    print ("Individual Stock Returns and Volatility\n")
    for i, txt in enumerate(returns.columns):
        print (txt,":","annualised return",round(an_rt[i],2),", annualised volatility:",round(an_vol[i],2))
    print ("-"*80)

    fig, ax = plt.subplots(figsize=(22, 15))
    ax.scatter(an_vol,an_rt,marker='o',s=700)

    for i, txt in enumerate(returns.columns):
        ax.annotate(txt, (an_vol[i],an_rt[i]), xytext=(18,0), textcoords='offset points')

    ax.scatter(sdp,rp,marker='*',color='r',s=500, label='Maximum Sharpe ratio')
    ax.scatter(sdp_min,rp_min,marker='*',color='g',s=500, label='Minimum volatility')

    target = np.linspace(rp_min, 0.45, 70)
    efficient_portfolios = efficient_frontier(mean_returns, cov_matrix, target)
    ax.plot([p['fun'] for p in efficient_portfolios], target, linestyle='-.', color='black', label='efficient frontier')
    ax.set_title('Portfolio Optimization with Individual Stocks')
    ax.set_xlabel('annualised volatility')
    ax.set_ylabel('annualised returns')
    ax.legend(labelspacing=0.8)
InĀ [1310]:
display_ef_with_selected(PortfolioRet, cov_matrix, risk_free_rate)
--------------------------------------------------------------------------------
Maximum Sharpe Ratio Portfolio Allocation

Annualised Return: 0.11
Annualised Volatility: 0.07


Ticker      AAPL  ACN  ADBE   AMD  AVGO  BAC  BRK-B  CRM  CSCO  JPM  ...  \
allocation  6.35  0.0   0.0  0.11  1.85  0.0    0.0  0.0   0.0  0.0  ...   

Ticker      MSFT  NVDA  ORCL    V  BND   GOVT  LQD    SHY  ^FVX  ^TNX  
allocation   0.0   8.0   0.0  0.0  0.0  34.22  0.0  44.96   4.5   0.0  

[1 rows x 21 columns]
--------------------------------------------------------------------------------
Minimum Volatility Portfolio Allocation

Annualised Return: 0.02
Annualised Volatility: 0.02


Ticker      AAPL  ACN  ADBE  AMD  AVGO  BAC  BRK-B  CRM  CSCO  JPM  ...  MSFT  \
allocation   0.0  0.0   0.0  0.0   0.0  0.0    0.0  0.0   0.0  0.0  ...   0.0   

Ticker      NVDA  ORCL    V  BND  GOVT  LQD    SHY  ^FVX  ^TNX  
allocation   0.0   0.0  0.0  0.0   0.0  0.0  98.58  0.83   0.6  

[1 rows x 21 columns]
--------------------------------------------------------------------------------
Individual Stock Returns and Volatility

AAPL : annualised return 0.38 , annualised volatility: 0.32
ACN : annualised return 0.24 , annualised volatility: 0.28
ADBE : annualised return 0.26 , annualised volatility: 0.37
AMD : annualised return 0.56 , annualised volatility: 0.54
AVGO : annualised return 0.4 , annualised volatility: 0.37
BAC : annualised return 0.15 , annualised volatility: 0.36
BRK-B : annualised return 0.14 , annualised volatility: 0.22
CRM : annualised return 0.21 , annualised volatility: 0.38
CSCO : annualised return 0.1 , annualised volatility: 0.28
JPM : annualised return 0.19 , annualised volatility: 0.32
MA : annualised return 0.22 , annualised volatility: 0.32
MSFT : annualised return 0.32 , annualised volatility: 0.3
NVDA : annualised return 0.67 , annualised volatility: 0.52
ORCL : annualised return 0.23 , annualised volatility: 0.31
V : annualised return 0.18 , annualised volatility: 0.28
BND : annualised return 0.01 , annualised volatility: 0.07
GOVT : annualised return 0.01 , annualised volatility: 0.06
LQD : annualised return 0.03 , annualised volatility: 0.11
SHY : annualised return 0.01 , annualised volatility: 0.02
^FVX : annualised return 0.38 , annualised volatility: 0.78
^TNX : annualised return 0.27 , annualised volatility: 0.64
--------------------------------------------------------------------------------
No description has been provided for this image

Diversification Factor¶

We achive a Portfolio Diversification Factor (PDF) of 8.53 indicating there's a high level of diversification in the portfolio. The portfolio’s volatility is 11.7% of what it would be if the portfolio was not diversified at all. Thecombination of assets in our portfolio significantly reduces overall risk which is a positive indicator.

InĀ [795]:
def fetch_data(tickers, start_date, end_date=None):
    data = yf.download(tickers, start=start_date, end=end_date)['Adj Close']
    return data
InĀ [797]:
def calculate_diversification_factor(weights, correlation_matrix):
    # Ensure weights and correlation matrix are numpy arrays
    weights = np.array(weights)
    correlation_matrix = np.array(correlation_matrix)

    # Calculate portfolio variance
    portfolio_variance = np.dot(weights.T, np.dot(correlation_matrix, weights))

    # Calculate the weighted variance (assuming all variances are 1 for simplicity)
    weighted_variance = np.sum(weights ** 2)

    # Diversification factor
    diversification_factor = portfolio_variance / weighted_variance

    return diversification_factor
InĀ [799]:
# filtered stocks selected for analysis
tickers = ['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO', 'BRK-B', 'JPM', 'V', 'MA', 'BAC']
start_date = '2019-01-01'
weights = [1/len(tickers) for i in tickers]

# Fetch historical data
data = fetch_data(tickers, start_date)

# Calculate diversification factor
pdf = calculate_diversification_factor(weights, correlation_matrix)
print("Portfolio Diversification Factor:", pdf)
[*********************100%%**********************]  15 of 15 completed
Portfolio Diversification Factor: 8.526933322445922

Value at Risk; Conditional Value at Risk¶

Our constructed portfolio portfolio is at risk of losing up to 1.66% (Portfolio VaR Return) or USD 16,648.59 on 95% of trading days. However, in the worst 5% of cases, if losses do occur, the average loss could be 2.18% (Portfolio CVaR Return) or USD 21,823.51. We can conclude that our portfolio risk is minimal.

InĀ [867]:
def value_at_risk(value_invested, returns, weights, alpha=0.95, lookback_days=780):
    returns = returns.fillna(0.0)
    portfolio_returns = returns.iloc[-lookback_days:].dot(weights)
    var = np.percentile(portfolio_returns, (1 - alpha) * 100)
    return value_invested * var
InĀ [869]:
def cvar(value_invested, returns, weights, alpha=0.95, lookback_days=780):
    var = value_at_risk(value_invested, returns, weights, alpha, lookback_days=lookback_days)
    returns = returns.fillna(0.0)
    portfolio_returns = returns.iloc[-lookback_days:].dot(weights)
    var_pct_loss = var / value_invested
    cvar_value = value_invested * np.nanmean(portfolio_returns[portfolio_returns < var_pct_loss])
    return cvar_value
InĀ [877]:
tickers = ['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO', 'BRK-B', 'JPM', 'V', 'MA', 'BAC']
start_date = '2019-01-01'
value_invested = 1000000

# Fetch historical data
data = fetch_data(tickers, start_date)
Returns = data.pct_change().dropna()

# Initialize and normalize weights
weights = np.ones((len(tickers), 1))
weights = weights / np.sum(weights)

# Split the data into training and test sets
Ret_tr = Returns.loc[tr_start_date:tr_end_date]
Ret_te = Returns.loc[te_start_date:te_end_date]

lookback_days = 780
alpha = 0.95

# Calculate portfolio returns for test set
portfolio_returns_te = Ret_te.fillna(0.0).iloc[-lookback_days:].dot(weights)

# Calculate VaR and CVaR
portfolio_VaR_te = value_at_risk(value_invested, Ret_te, weights, alpha=alpha)
portfolio_VaR_return_te = portfolio_VaR_te / value_invested
portfolio_CVaR_te = cvar(value_invested, Ret_te, weights, alpha=alpha)
portfolio_CVaR_return_te = portfolio_CVaR_te / value_invested
[*********************100%%**********************]  15 of 15 completed
InĀ [875]:
print(f"Portfolio VaR Return (95%): {portfolio_VaR_return_te * 100:.2f}%")
print(f"Portfolio VaR Amount (95%): ${portfolio_VaR_te:.2f}")
print(f"Portfolio CVaR Return (95%): {portfolio_CVaR_return_te * 100:.2f}%")
print(f"Portfolio CVaR Amount (95%): ${portfolio_CVaR_te:.2f}")

plt.hist(portfolio_returns_te[portfolio_returns_te > portfolio_VaR_return_te] * 100, bins=30, color='green')
plt.hist(portfolio_returns_te[portfolio_returns_te < portfolio_VaR_return_te] * 100, bins=15, color='orange')
plt.axvline(portfolio_VaR_return_te * 100, color='blue', linestyle='solid')
plt.axvline(portfolio_CVaR_return_te * 100, color='red', linestyle='dashed')
plt.legend(['VaR for Specified Alpha as a Return',
            'CVaR for Specified Alpha as a Return',
            'Historical Returns Distribution',
            'Returns < VaR'])
plt.title('Historical 95% VaR and CVaR')
plt.xlabel('Return in %')
plt.ylabel('Observation Frequency')
plt.show()
Portfolio VaR Return (95%): -1.66%
Portfolio VaR Amount (95%): $-16648.59
Portfolio CVaR Return (95%): -2.18%
Portfolio CVaR Amount (95%): $-21823.51
No description has been provided for this image

Volatility, Hedged Portfolio PnL¶

For risk management and further optimizing our portfolio, we remove MPC from the portfolio as its the only filtered stock from the Energy sector in the basket. Then we calculate residuals and betas of stocks within the two remaining sectors (Technology and Finance) relative to the overall market and their respective sector benchmarks.

  • The covariance matrix of the residuals is calculated using the Ledoit-Wolf shrinkage
  • The average pairwise correlation of our portfolio is only 0.0118 indicating that the stocks within our portfolio have very low correlation and are diversified
  • The Beta of the technology stocks basket is 1.3582, beta of the finance stocks basket is 1.1040; and the overall portfolio beta is 1.2492
InĀ [946]:
start_date = '2019-01-01'
SectorHedgeBasket = ['ADBE', 'GS', 'MSFT', 'MA', 'ACN', 'CRM', 'BRK-B', 'AMD', 'AXP', 'AAPL', 'V', 'JPM', 'AVGO', 'NVDA', 'XLK', 'SPY', 'WFC']
prices_SHedgeB = fetch_data(SectorHedgeBasket, start_date)

# Calculate returns
rets = prices_SHedgeB.pct_change().dropna()
[*********************100%%**********************]  17 of 17 completed
InĀ [950]:
# Get market hedge ticker
mkt = ['SPY']
# Get sector hedge tickers
sector_1_hedge = ['XLK']  # Technology
sector_2_hedge = ['WFC']  # Finance

# Identify securities for each sector
sector_1_stocks = ['ADBE', 'MSFT', 'ACN', 'CRM', 'AMD', 'AAPL', 'AVGO', 'NVDA']  # Technology stocks
sector_2_stocks = ['GS', 'MA', 'BRK-B', 'AXP', 'V', 'JPM']  # Finance stocks
InĀ [952]:
# Extract returns
market = rets[mkt]
sector_1_rets = rets[sector_1_hedge]
sector_2_rets = rets[sector_2_hedge]

stock_rets = rets.drop(['XLK', 'SPY', 'WFC'], axis=1)
residuals_market = stock_rets.copy() * 0
residuals = stock_rets.copy() * 0
betas = {}

# Calculate market beta of sector 1 benchmark
model = sm.OLS(sector_1_rets.values, market.values)
results = model.fit()
sector_1_excess = results.resid

# Calculate market beta of sector 2 benchmark
model = sm.OLS(sector_2_rets.values, market.values)
results = model.fit()
sector_2_excess = results.resid
InĀ [954]:
# Calculate residuals and betas for sector 1 stocks
for stock in sector_1_stocks:
    model = sm.OLS(stock_rets[stock], market.values)
    results = model.fit()
    betas[stock] = results.params[0]
    residuals_market[stock] = results.resid

    model = sm.OLS(residuals_market[stock], sector_1_excess)
    results = model.fit()
    residuals[stock] = results.resid

# Calculate residuals and betas for sector 2 stocks
for stock in sector_2_stocks:
    model = sm.OLS(stock_rets[stock], market.values)
    results = model.fit()
    betas[stock] = results.params[0]
    residuals_market[stock] = results.resid

    model = sm.OLS(residuals_market[stock], sector_2_excess)
    results = model.fit()
    residuals[stock] = results.resid

# Get covariance of residuals
lw_cov = LedoitWolf().fit(residuals).covariance_
InĀ [956]:
# Extract correlation from covariance
def extract_corr_from_cov(cov):
    d = np.sqrt(np.diag(cov))
    corr = cov / np.outer(d, d)
    corr[cov == 0] = 0
    return corr

# Create cumulative returns
Cumret2 = 100 * (1 + residuals).cumprod()

# Plot cumulative returns and correlation heatmap
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 7))
fig.tight_layout(pad=3.0)

# Plot cumulative returns
Cumret2.plot(ax=ax1)
ax1.set_ylabel("Cumulative Return (%)")
ax1.set_title("Cumulative Returns")

# Plot heatmap of correlations
corr = extract_corr_from_cov(lw_cov)
sns.heatmap(corr, ax=ax2, fmt='d', vmin=-1, vmax=1, xticklabels=stock_rets.columns, yticklabels=stock_rets.columns)
ax2.set_title("Correlation Heatmap")

plt.show()
No description has been provided for this image
InĀ [958]:
# Calculate average pairwise correlation
average_corr = np.mean(corr[np.triu_indices_from(corr, k=1)])
print('Average pairwise correlation: %.4f' % average_corr)
Average pairwise correlation: 0.0118
InĀ [1344]:
# Calculate and plot PnL for the hedged portfolio
weights = np.ones(len(stock_rets.columns)) / len(stock_rets.columns)
portfolio_returns = (residuals @ weights).cumsum()
initial_investment = 100000
pnl = initial_investment * (1 + portfolio_returns)

plt.figure(figsize=(14, 7))
pnl.plot(title='Hedged Portfolio PnL')
plt.xlabel('Time')
plt.ylabel('PnL ($)')
plt.show()
No description has been provided for this image
InĀ [964]:
# Calculate the average beta for each sector
avg_beta_sector_1 = np.mean([betas[stock] for stock in sector_1_stocks])
avg_beta_sector_2 = np.mean([betas[stock] for stock in sector_2_stocks])

print(f'Beta of the Technology Stocks Basket: {avg_beta_sector_1:.4f}')
print(f'Beta of the Finance Stocks Basket: {avg_beta_sector_2:.4f}')

# Calculate the overall portfolio beta
portfolio_beta = sum(weights[i] * betas[stock] for i, stock in enumerate(stock_rets.columns))
print(f'Overall Portfolio Beta: {portfolio_beta:.4f}')
Beta of the Technology Stocks Basket: 1.3582
Beta of the Finance Stocks Basket: 1.1040
Overall Portfolio Beta: 1.2492